The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The HomeCredit_columns_description.csv acts as a data dictioanry.
There are 7 different sources of data:
name [ rows cols] MegaBytes
----------------------- ------------------ -------
application_train : [ 307,511, 122]: 158MB
application_test : [ 48,744, 121]: 25MB
bureau : [ 1,716,428, 17] 162MB
bureau_balance : [ 27,299,925, 3]: 358MB
credit_card_balance : [ 3,840,312, 23] 405MB
installments_payments : [ 13,605,401, 8] 690MB
previous_application : [ 1,670,214, 37] 386MB
POS_CASH_balance : [ 10,001,358, 8] 375MB
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
files = ["HCDR/" + f for f in os.listdir("HCDR")]
datasets = {}
for f in files:
name = f.split("/")[1].split(".")[0]
print(f"Loading {name}")
datasets[name] = pd.read_csv(f, encoding='latin-1')
print()
print(datasets.keys())
Loading credit_card_balance Loading installments_payments Loading bureau_balance Loading application_train Loading POS_CASH_balance Loading application_test Loading bureau Loading previous_application dict_keys(['credit_card_balance', 'installments_payments', 'bureau_balance', 'application_train', 'POS_CASH_balance', 'application_test', 'bureau', 'previous_application'])
| Table | Row | Description | Special | |
|---|---|---|---|---|
| application_{train | test}.csv | SK_ID_CURR | ID of loan in our sample | |
| application_{train | test}.csv | TARGET | Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) | |
| application_{train | test}.csv | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | |
| application_{train | test}.csv | CODE_GENDER | Gender of the client | |
| application_{train | test}.csv | FLAG_OWN_CAR | Flag if the client owns a car | |
| application_{train | test}.csv | FLAG_OWN_REALTY | Flag if client owns a house or flat | |
| application_{train | test}.csv | CNT_CHILDREN | Number of children the client has | |
| application_{train | test}.csv | AMT_INCOME_TOTAL | Income of the client | |
| application_{train | test}.csv | AMT_CREDIT | Credit amount of the loan | |
| application_{train | test}.csv | AMT_ANNUITY | Loan annuity | |
| application_{train | test}.csv | AMT_GOODS_PRICE | For consumer loans it is the price of the goods for which the loan is given | |
| application_{train | test}.csv | NAME_TYPE_SUITE | Who was accompanying client when he was applying for the loan | |
| application_{train | test}.csv | NAME_INCOME_TYPE | Clients income type (businessman, working, maternity leave,…) | |
| application_{train | test}.csv | NAME_EDUCATION_TYPE | Level of highest education the client achieved | |
| application_{train | test}.csv | NAME_FAMILY_STATUS | Family status of the client | |
| application_{train | test}.csv | NAME_HOUSING_TYPE | What is the housing situation of the client (renting, living with parents, ...) | |
| application_{train | test}.csv | REGION_POPULATION_RELATIVE | Normalized population of region where client lives (higher number means the client lives in more populated region) | normalized |
| application_{train | test}.csv | DAYS_BIRTH | Client's age in days at the time of application | time only relative to the application |
| application_{train | test}.csv | DAYS_EMPLOYED | How many days before the application the person started current employment | time only relative to the application |
| application_{train | test}.csv | DAYS_REGISTRATION | How many days before the application did client change his registration | time only relative to the application |
| application_{train | test}.csv | DAYS_ID_PUBLISH | How many days before the application did client change the identity document with which he applied for the loan | time only relative to the application |
| application_{train | test}.csv | OWN_CAR_AGE | Age of client's car | |
| application_{train | test}.csv | FLAG_MOBIL | Did client provide mobile phone (1=YES, 0=NO) | |
| application_{train | test}.csv | FLAG_EMP_PHONE | Did client provide work phone (1=YES, 0=NO) | |
| application_{train | test}.csv | FLAG_WORK_PHONE | Did client provide home phone (1=YES, 0=NO) | |
| application_{train | test}.csv | FLAG_CONT_MOBILE | Was mobile phone reachable (1=YES, 0=NO) | |
| application_{train | test}.csv | FLAG_PHONE | Did client provide home phone (1=YES, 0=NO) | |
| application_{train | test}.csv | FLAG_EMAIL | Did client provide email (1=YES, 0=NO) | |
| application_{train | test}.csv | OCCUPATION_TYPE | What kind of occupation does the client have | |
| application_{train | test}.csv | CNT_FAM_MEMBERS | How many family members does client have | |
| application_{train | test}.csv | REGION_RATING_CLIENT | Our rating of the region where client lives (1,2,3) | |
| application_{train | test}.csv | REGION_RATING_CLIENT_W_CITY | Our rating of the region where client lives with taking city into account (1,2,3) | |
| application_{train | test}.csv | WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply for the loan | |
| application_{train | test}.csv | HOUR_APPR_PROCESS_START | Approximately at what hour did the client apply for the loan | rounded |
| application_{train | test}.csv | REG_REGION_NOT_LIVE_REGION | Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) | |
| application_{train | test}.csv | REG_REGION_NOT_WORK_REGION | Flag if client's permanent address does not match work address (1=different, 0=same, at region level) | |
| application_{train | test}.csv | LIVE_REGION_NOT_WORK_REGION | Flag if client's contact address does not match work address (1=different, 0=same, at region level) | |
| application_{train | test}.csv | REG_CITY_NOT_LIVE_CITY | Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) | |
| application_{train | test}.csv | REG_CITY_NOT_WORK_CITY | Flag if client's permanent address does not match work address (1=different, 0=same, at city level) | |
| application_{train | test}.csv | LIVE_CITY_NOT_WORK_CITY | Flag if client's contact address does not match work address (1=different, 0=same, at city level) | |
| application_{train | test}.csv | ORGANIZATION_TYPE | Type of organization where client works | |
| application_{train | test}.csv | EXT_SOURCE_1 | Normalized score from external data source | normalized |
| application_{train | test}.csv | EXT_SOURCE_2 | Normalized score from external data source | normalized |
| application_{train | test}.csv | EXT_SOURCE_3 | Normalized score from external data source | normalized |
| application_{train | test}.csv | APARTMENTS_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | BASEMENTAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | YEARS_BEGINEXPLUATATION_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | YEARS_BUILD_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | COMMONAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | ELEVATORS_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | ENTRANCES_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | FLOORSMAX_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | FLOORSMIN_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LANDAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LIVINGAPARTMENTS_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LIVINGAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | NONLIVINGAPARTMENTS_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | NONLIVINGAREA_AVG | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | APARTMENTS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | BASEMENTAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | YEARS_BEGINEXPLUATATION_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | YEARS_BUILD_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | COMMONAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | ELEVATORS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | ENTRANCES_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | FLOORSMAX_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | FLOORSMIN_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LANDAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LIVINGAPARTMENTS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LIVINGAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | NONLIVINGAPARTMENTS_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | NONLIVINGAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | APARTMENTS_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | BASEMENTAREA_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | YEARS_BEGINEXPLUATATION_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | YEARS_BUILD_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | COMMONAREA_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | ELEVATORS_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | ENTRANCES_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | FLOORSMAX_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | FLOORSMIN_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LANDAREA_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LIVINGAPARTMENTS_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | LIVINGAREA_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | NONLIVINGAPARTMENTS_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | NONLIVINGAREA_MEDI | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | FONDKAPREMONT_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | HOUSETYPE_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | TOTALAREA_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | WALLSMATERIAL_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | EMERGENCYSTATE_MODE | Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor | normalized |
| application_{train | test}.csv | OBS_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 30 DPD (days past due) default | |
| application_{train | test}.csv | DEF_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 30 DPD (days past due) | |
| application_{train | test}.csv | OBS_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings with observable 60 DPD (days past due) default | |
| application_{train | test}.csv | DEF_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surroundings defaulted on 60 (days past due) DPD | |
| application_{train | test}.csv | DAYS_LAST_PHONE_CHANGE | How many days before application did client change phone | |
| application_{train | test}.csv | FLAG_DOCUMENT_2 | Did client provide document 2 | |
| application_{train | test}.csv | FLAG_DOCUMENT_3 | Did client provide document 3 | |
| application_{train | test}.csv | FLAG_DOCUMENT_4 | Did client provide document 4 | |
| application_{train | test}.csv | FLAG_DOCUMENT_5 | Did client provide document 5 | |
| application_{train | test}.csv | FLAG_DOCUMENT_6 | Did client provide document 6 | |
| application_{train | test}.csv | FLAG_DOCUMENT_7 | Did client provide document 7 | |
| application_{train | test}.csv | FLAG_DOCUMENT_8 | Did client provide document 8 | |
| application_{train | test}.csv | FLAG_DOCUMENT_9 | Did client provide document 9 | |
| application_{train | test}.csv | FLAG_DOCUMENT_10 | Did client provide document 10 | |
| application_{train | test}.csv | FLAG_DOCUMENT_11 | Did client provide document 11 | |
| application_{train | test}.csv | FLAG_DOCUMENT_12 | Did client provide document 12 | |
| application_{train | test}.csv | FLAG_DOCUMENT_13 | Did client provide document 13 | |
| application_{train | test}.csv | FLAG_DOCUMENT_14 | Did client provide document 14 | |
| application_{train | test}.csv | FLAG_DOCUMENT_15 | Did client provide document 15 | |
| application_{train | test}.csv | FLAG_DOCUMENT_16 | Did client provide document 16 | |
| application_{train | test}.csv | FLAG_DOCUMENT_17 | Did client provide document 17 | |
| application_{train | test}.csv | FLAG_DOCUMENT_18 | Did client provide document 18 | |
| application_{train | test}.csv | FLAG_DOCUMENT_19 | Did client provide document 19 | |
| application_{train | test}.csv | FLAG_DOCUMENT_20 | Did client provide document 20 | |
| application_{train | test}.csv | FLAG_DOCUMENT_21 | Did client provide document 21 | |
| application_{train | test}.csv | AMT_REQ_CREDIT_BUREAU_HOUR | Number of enquiries to Credit Bureau about the client one hour before application | |
| application_{train | test}.csv | AMT_REQ_CREDIT_BUREAU_DAY | Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) | |
| application_{train | test}.csv | AMT_REQ_CREDIT_BUREAU_WEEK | Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) | |
| application_{train | test}.csv | AMT_REQ_CREDIT_BUREAU_MON | Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) | |
| application_{train | test}.csv | AMT_REQ_CREDIT_BUREAU_QRT | Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application) | |
| application_{train | test}.csv | AMT_REQ_CREDIT_BUREAU_YEAR | Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application) | |
| bureau.csv | SK_ID_CURR | ID of loan in our sample - one loan in our sample can have 0,1,2 or more related previous credits in credit bureau | hashed | |
| bureau.csv | SK_BUREAU_ID | Recoded ID of previous Credit Bureau credit related to our loan (unique coding for each loan application) | hashed | |
| bureau.csv | CREDIT_ACTIVE | Status of the Credit Bureau (CB) reported credits | ||
| bureau.csv | CREDIT_CURRENCY | Recoded currency of the Credit Bureau credit | recoded | |
| bureau.csv | DAYS_CREDIT | How many days before current application did client apply for Credit Bureau credit | time only relative to the application | |
| bureau.csv | CREDIT_DAY_OVERDUE | Number of days past due on CB credit at the time of application for related loan in our sample | ||
| bureau.csv | DAYS_CREDIT_ENDDATE | Remaining duration of CB credit (in days) at the time of application in Home Credit | time only relative to the application | |
| bureau.csv | DAYS_ENDDATE_FACT | Days since CB credit ended at the time of application in Home Credit (only for closed credit) | time only relative to the application | |
| bureau.csv | AMT_CREDIT_MAX_OVERDUE | Maximal amount overdue on the Credit Bureau credit so far (at application date of loan in our sample) | ||
| bureau.csv | CNT_CREDIT_PROLONG | How many times was the Credit Bureau credit prolonged | ||
| bureau.csv | AMT_CREDIT_SUM | Current credit amount for the Credit Bureau credit | ||
| bureau.csv | AMT_CREDIT_SUM_DEBT | Current debt on Credit Bureau credit | ||
| bureau.csv | AMT_CREDIT_SUM_LIMIT | Current credit limit of credit card reported in Credit Bureau | ||
| bureau.csv | AMT_CREDIT_SUM_OVERDUE | Current amount overdue on Credit Bureau credit | ||
| bureau.csv | CREDIT_TYPE | Type of Credit Bureau credit (Car, cash,...) | ||
| bureau.csv | DAYS_CREDIT_UPDATE | How many days before loan application did last information about the Credit Bureau credit come | time only relative to the application | |
| bureau.csv | AMT_ANNUITY | Annuity of the Credit Bureau credit | ||
| bureau_balance.csv | SK_BUREAU_ID | Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table | hashed | |
| bureau_balance.csv | MONTHS_BALANCE | Month of balance relative to application date (-1 means the freshest balance date) | time only relative to the application | |
| bureau_balance.csv | STATUS | Status of Credit Bureau loan during the month (active, closed, DPD0-30,… [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60,… 5 means DPD 120+ or sold or written off ] ) | ||
| POS_CASH_balance.csv | SK_ID_PREV | ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) | ||
| POS_CASH_balance.csv | SK_ID_CURR | ID of loan in our sample | ||
| POS_CASH_balance.csv | MONTHS_BALANCE | Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly ) | time only relative to the application | |
| POS_CASH_balance.csv | CNT_INSTALMENT | Term of previous credit (can change over time) | ||
| POS_CASH_balance.csv | CNT_INSTALMENT_FUTURE | Installments left to pay on the previous credit | ||
| POS_CASH_balance.csv | NAME_CONTRACT_STATUS | Contract status during the month | ||
| POS_CASH_balance.csv | SK_DPD | DPD (days past due) during the month of previous credit | ||
| POS_CASH_balance.csv | SK_DPD_DEF | DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit | ||
| credit_card_balance.csv | SK_ID_PREV | ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) | hashed | |
| credit_card_balance.csv | SK_ID_CURR | ID of loan in our sample | hashed | |
| credit_card_balance.csv | MONTHS_BALANCE | Month of balance relative to application date (-1 means the freshest balance date) | time only relative to the application | |
| credit_card_balance.csv | AMT_BALANCE | Balance during the month of previous credit | ||
| credit_card_balance.csv | AMT_CREDIT_LIMIT_ACTUAL | Credit card limit during the month of the previous credit | ||
| credit_card_balance.csv | AMT_DRAWINGS_ATM_CURRENT | Amount drawing at ATM during the month of the previous credit | ||
| credit_card_balance.csv | AMT_DRAWINGS_CURRENT | Amount drawing during the month of the previous credit | ||
| credit_card_balance.csv | AMT_DRAWINGS_OTHER_CURRENT | Amount of other drawings during the month of the previous credit | ||
| credit_card_balance.csv | AMT_DRAWINGS_POS_CURRENT | Amount drawing or buying goods during the month of the previous credit | ||
| credit_card_balance.csv | AMT_INST_MIN_REGULARITY | Minimal installment for this month of the previous credit | ||
| credit_card_balance.csv | AMT_PAYMENT_CURRENT | How much did the client pay during the month on the previous credit | ||
| credit_card_balance.csv | AMT_PAYMENT_TOTAL_CURRENT | How much did the client pay during the month in total on the previous credit | ||
| credit_card_balance.csv | AMT_RECEIVABLE_PRINCIPAL | Amount receivable for principal on the previous credit | ||
| credit_card_balance.csv | AMT_RECIVABLE | Amount receivable on the previous credit | ||
| credit_card_balance.csv | AMT_TOTAL_RECEIVABLE | Total amount receivable on the previous credit | ||
| credit_card_balance.csv | CNT_DRAWINGS_ATM_CURRENT | Number of drawings at ATM during this month on the previous credit | ||
| credit_card_balance.csv | CNT_DRAWINGS_CURRENT | Number of drawings during this month on the previous credit | ||
| credit_card_balance.csv | CNT_DRAWINGS_OTHER_CURRENT | Number of other drawings during this month on the previous credit | ||
| credit_card_balance.csv | CNT_DRAWINGS_POS_CURRENT | Number of drawings for goods during this month on the previous credit | ||
| credit_card_balance.csv | CNT_INSTALMENT_MATURE_CUM | Number of paid installments on the previous credit | ||
| credit_card_balance.csv | NAME_CONTRACT_STATUS | Contract status (active signed,...) on the previous credit | ||
| credit_card_balance.csv | SK_DPD | DPD (Days past due) during the month on the previous credit | ||
| credit_card_balance.csv | SK_DPD_DEF | DPD (Days past due) during the month with tolerance (debts with low loan amounts are ignored) of the previous credit | ||
| previous_application.csv | SK_ID_PREV | ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loan applications in Home Credit, previous application could, but not necessarily have to lead to credit) | hashed | |
| previous_application.csv | SK_ID_CURR | ID of loan in our sample | hashed | |
| previous_application.csv | NAME_CONTRACT_TYPE | Contract product type (Cash loan, consumer loan [POS] ,...) of the previous application | ||
| previous_application.csv | AMT_ANNUITY | Annuity of previous application | ||
| previous_application.csv | AMT_APPLICATION | For how much credit did client ask on the previous application | ||
| previous_application.csv | AMT_CREDIT | Final credit amount on the previous application. This differs from AMT_APPLICATION in a way that the AMT_APPLICATION is the amount for which the client initially applied for, but during our approval process he could have received different amount - AMT_CREDIT | ||
| previous_application.csv | AMT_DOWN_PAYMENT | Down payment on the previous application | ||
| previous_application.csv | AMT_GOODS_PRICE | Goods price of good that client asked for (if applicable) on the previous application | ||
| previous_application.csv | WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply for previous application | ||
| previous_application.csv | HOUR_APPR_PROCESS_START | Approximately at what day hour did the client apply for the previous application | rounded | |
| previous_application.csv | FLAG_LAST_APPL_PER_CONTRACT | Flag if it was last application for the previous contract. Sometimes by mistake of client or our clerk there could be more applications for one single contract | ||
| previous_application.csv | NFLAG_LAST_APPL_IN_DAY | Flag if the application was the last application per day of the client. Sometimes clients apply for more applications a day. Rarely it could also be error in our system that one application is in the database twice | ||
| previous_application.csv | NFLAG_MICRO_CASH | Flag Micro finance loan | ||
| previous_application.csv | RATE_DOWN_PAYMENT | Down payment rate normalized on previous credit | normalized | |
| previous_application.csv | RATE_INTEREST_PRIMARY | Interest rate normalized on previous credit | normalized | |
| previous_application.csv | RATE_INTEREST_PRIVILEGED | Interest rate normalized on previous credit | normalized | |
| previous_application.csv | NAME_CASH_LOAN_PURPOSE | Purpose of the cash loan | ||
| previous_application.csv | NAME_CONTRACT_STATUS | Contract status (approved, cancelled, ...) of previous application | ||
| previous_application.csv | DAYS_DECISION | Relative to current application when was the decision about previous application made | time only relative to the application | |
| previous_application.csv | NAME_PAYMENT_TYPE | Payment method that client chose to pay for the previous application | ||
| previous_application.csv | CODE_REJECT_REASON | Why was the previous application rejected | ||
| previous_application.csv | NAME_TYPE_SUITE | Who accompanied client when applying for the previous application | ||
| previous_application.csv | NAME_CLIENT_TYPE | Was the client old or new client when applying for the previous application | ||
| previous_application.csv | NAME_GOODS_CATEGORY | What kind of goods did the client apply for in the previous application | ||
| previous_application.csv | NAME_PORTFOLIO | Was the previous application for CASH, POS, CAR, … | ||
| previous_application.csv | NAME_PRODUCT_TYPE | Was the previous application x-sell o walk-in | ||
| previous_application.csv | CHANNEL_TYPE | Through which channel we acquired the client on the previous application | ||
| previous_application.csv | SELLERPLACE_AREA | Selling area of seller place of the previous application | ||
| previous_application.csv | NAME_SELLER_INDUSTRY | The industry of the seller | ||
| previous_application.csv | CNT_PAYMENT | Term of previous credit at application of the previous application | ||
| previous_application.csv | NAME_YIELD_GROUP | Grouped interest rate into small medium and high of the previous application | grouped | |
| previous_application.csv | PRODUCT_COMBINATION | Detailed product combination of the previous application | ||
| previous_application.csv | DAYS_FIRST_DRAWING | Relative to application date of current application when was the first disbursement of the previous application | time only relative to the application | |
| previous_application.csv | DAYS_FIRST_DUE | Relative to application date of current application when was the first due supposed to be of the previous application | time only relative to the application | |
| previous_application.csv | DAYS_LAST_DUE_1ST_VERSION | Relative to application date of current application when was the first due of the previous application | time only relative to the application | |
| previous_application.csv | DAYS_LAST_DUE | Relative to application date of current application when was the last due date of the previous application | time only relative to the application | |
| previous_application.csv | DAYS_TERMINATION | Relative to application date of current application when was the expected termination of the previous application | time only relative to the application | |
| previous_application.csv | NFLAG_INSURED_ON_APPROVAL | Did the client requested insurance during the previous application | ||
| installments_payments.csv | SK_ID_PREV | ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit) | hashed | |
| installments_payments.csv | SK_ID_CURR | ID of loan in our sample | hashed | |
| installments_payments.csv | NUM_INSTALMENT_VERSION | Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed | ||
| installments_payments.csv | NUM_INSTALMENT_NUMBER | On which installment we observe payment | ||
| installments_payments.csv | DAYS_INSTALMENT | When the installment of previous credit was supposed to be paid (relative to application date of current loan) | time only relative to the application | |
| installments_payments.csv | DAYS_ENTRY_PAYMENT | When was the installments of previous credit paid actually (relative to application date of current loan) | time only relative to the application | |
| installments_payments.csv | AMT_INSTALMENT | What was the prescribed installment amount of previous credit on this installment | ||
| installments_payments.csv | AMT_PAYMENT | What the client actually paid on previous credit on this installment | ||
appsDF = datasets["previous_application"]
display(appsDF.head())
print(f"{appsDF.shape[0]:,} rows, {appsDF.shape[1]:,} columns")
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
1,670,214 rows, 37 columns
# Create aggregate features (via pipeline)
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None): # no *args or **kargs
self.features = features
self.agg_op_features = {}
for f in features:
# self.agg_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
self.agg_op_features[f] = ["min", "max", "mean"]
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
# result.columns = result.columns.droplevel()
result.columns = ["_".join(x) for x in result.columns.ravel()]
result = result.reset_index(level=["SK_ID_CURR"])
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
return result # return dataframe with the join key "SK_ID_CURR"
from sklearn.pipeline import make_pipeline
def test_driver_prevAppsFeaturesAggregater(df, features):
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n{df[features][0:5]}")
test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
return(test_pipeline.fit_transform(df))
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print(f"HELLO")
print(f"Test driver: \n{res[0:10]}")
print(f"input[features][0:10]: \n{appsDF[0:10]}")
df.shape: (1670214, 37) df[['AMT_ANNUITY', 'AMT_APPLICATION']][0:5]: AMT_ANNUITY AMT_APPLICATION 0 1730.430 17145.0 1 25188.615 607500.0 2 15060.735 112500.0 3 47041.335 450000.0 4 31924.395 337500.0 HELLO Test driver: SK_ID_CURR AMT_ANNUITY_min AMT_ANNUITY_max AMT_ANNUITY_mean \ 0 100001 3951.000 3951.000 3951.000000 1 100002 9251.775 9251.775 9251.775000 2 100003 6737.310 98356.995 56553.990000 3 100004 5357.250 5357.250 5357.250000 4 100005 4813.200 4813.200 4813.200000 5 100006 2482.920 39954.510 23651.175000 6 100007 1834.290 22678.785 12278.805000 7 100008 8019.090 25309.575 15839.696250 8 100009 7435.845 17341.605 10051.412143 9 100010 27463.410 27463.410 27463.410000 AMT_APPLICATION_min AMT_APPLICATION_max AMT_APPLICATION_mean \ 0 24835.5 24835.5 24835.500000 1 179055.0 179055.0 179055.000000 2 68809.5 900000.0 435436.500000 3 24282.0 24282.0 24282.000000 4 0.0 44617.5 22308.750000 5 0.0 688500.0 272203.260000 6 17176.5 247500.0 150530.250000 7 0.0 450000.0 155701.800000 8 40455.0 110160.0 76741.714286 9 247212.0 247212.0 247212.000000 range_AMT_APPLICATION 0 0.0 1 0.0 2 831190.5 3 0.0 4 44617.5 5 688500.0 6 230323.5 7 450000.0 8 69705.0 9 0.0 input[features][0:10]: SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION \ 0 2030495 271877 Consumer loans 1730.430 17145.0 1 2802425 108129 Cash loans 25188.615 607500.0 2 2523466 122040 Cash loans 15060.735 112500.0 3 2819243 176158 Cash loans 47041.335 450000.0 4 1784265 202054 Cash loans 31924.395 337500.0 5 1383531 199383 Cash loans 23703.930 315000.0 6 2315218 175704 Cash loans NaN 0.0 7 1656711 296299 Cash loans NaN 0.0 8 2367563 342292 Cash loans NaN 0.0 9 2579447 334349 Cash loans NaN 0.0 AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START \ 0 17145.0 0.0 17145.0 SATURDAY 1 679671.0 NaN 607500.0 THURSDAY 2 136444.5 NaN 112500.0 TUESDAY 3 470790.0 NaN 450000.0 MONDAY 4 404055.0 NaN 337500.0 THURSDAY 5 340573.5 NaN 315000.0 SATURDAY 6 0.0 NaN NaN TUESDAY 7 0.0 NaN NaN MONDAY 8 0.0 NaN NaN MONDAY 9 0.0 NaN NaN SATURDAY HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT \ 0 15 ... Connectivity 12.0 1 11 ... XNA 36.0 2 11 ... XNA 12.0 3 7 ... XNA 12.0 4 9 ... XNA 24.0 5 8 ... XNA 18.0 6 11 ... XNA NaN 7 7 ... XNA NaN 8 15 ... XNA NaN 9 15 ... XNA NaN NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING \ 0 middle POS mobile with interest 365243.0 1 low_action Cash X-Sell: low 365243.0 2 high Cash X-Sell: high 365243.0 3 middle Cash X-Sell: middle 365243.0 4 high Cash Street: high NaN 5 low_normal Cash X-Sell: low 365243.0 6 XNA Cash NaN 7 XNA Cash NaN 8 XNA Cash NaN 9 XNA Cash NaN DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION \ 0 -42.0 300.0 -42.0 -37.0 1 -134.0 916.0 365243.0 365243.0 2 -271.0 59.0 365243.0 365243.0 3 -482.0 -152.0 -182.0 -177.0 4 NaN NaN NaN NaN 5 -654.0 -144.0 -144.0 -137.0 6 NaN NaN NaN NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN NFLAG_INSURED_ON_APPROVAL 0 0.0 1 1.0 2 1.0 3 1.0 4 NaN 5 1.0 6 NaN 7 NaN 8 NaN 9 NaN [10 rows x 37 columns]
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
prevApps_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
# ('prevApps_aggregater', prevAppsFeaturesAggregater()), # Aggregate across old and new features
('prevApps_aggregater', prevAppsFeaturesAggregater(features)), # Aggregate across old and new features
])
X_train= datasets["application_train"] #primary dataset
appsDF = datasets["previous_application"] #prev app
merge_all_data = False
# transform all the secondary tables
# 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
# 'previous_application', 'POS_CASH_balance'
if merge_all_data:
prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
#'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
# 'previous_application', 'POS_CASH_balance'
bureau_aggregated = bureau_feature_pipeline.transform(bureau_DF)
bureau_bal_aggregated = bureau_bal_feature_pipeline.transform(bureau_bal)
cc_bal_aggregated = cc_bal_feature_pipeline.transform(cc_bal_DF)
install_pmt_aggregated = install_pmt_feature_pipeline.transform(install_pmt_DF)
POS_cash_bal_aggregated = POS_cash_bal_feature_pipeline.transform(POS_cash_DF)
X_kaggle_test= datasets["application_test"]
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# 2. Join/Merge in bureau Data
X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how='left', on='SK_ID_CURR')
# 3. Join/Merge in bureau_balance Data
X_kaggle_test = X_kaggle_test.merge(bureau_bal_aggregated, how='left', on='SK_ID_CURR')
# 4. Join/Merge in credit_card_balance Data
X_kaggle_test = X_kaggle_test.merge(cc_bal_aggregated, how='left', on='SK_ID_CURR')
# 5. Join/Merge in installments_payments Data
X_kaggle_test = X_kaggle_test.merge(install_pmt_aggregated, how='left', on='SK_ID_CURR')
# 6. Join/Merge in POS_cash_balance Data
X_kaggle_test = X_kaggle_test.merge(POS_cash_bal_aggregated, how='left', on='SK_ID_CURR')
We see a very unbalanced target variable. Clearly, most people repaid their loans. This is exactly why we intend to use the F-1 Score, which can handle this imbalance as a scoring metric.
import matplotlib.pyplot as plt
%matplotlib inline
train = datasets['application_train']
plt.figure(figsize=(10,6))
train["TARGET"].astype(int).plot.hist()
plt.title("Distribution of Target")
plt.xlabel("Target Class - 0: the loan was repaid / 1: the loan was not repaid")
Text(0.5, 0, 'Target Class - 0: the loan was repaid / 1: the loan was not repaid')
def print_info(df):
print ("INFO:")
print(datasets[df].info(verbose=True, null_counts=True))
print()
print("DATA DESCRIPTION: ")
print(datasets[df].describe())
for file_name in datasets.keys():
print(f"File: {file_name}".upper())
print("--------------------------")
print_info(file_name)
print()
print("*************************************************************************************")
print()
FILE: CREDIT_CARD_BALANCE
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 3840312 non-null int64
1 SK_ID_CURR 3840312 non-null int64
2 MONTHS_BALANCE 3840312 non-null int64
3 AMT_BALANCE 3840312 non-null float64
4 AMT_CREDIT_LIMIT_ACTUAL 3840312 non-null int64
5 AMT_DRAWINGS_ATM_CURRENT 3090496 non-null float64
6 AMT_DRAWINGS_CURRENT 3840312 non-null float64
7 AMT_DRAWINGS_OTHER_CURRENT 3090496 non-null float64
8 AMT_DRAWINGS_POS_CURRENT 3090496 non-null float64
9 AMT_INST_MIN_REGULARITY 3535076 non-null float64
10 AMT_PAYMENT_CURRENT 3072324 non-null float64
11 AMT_PAYMENT_TOTAL_CURRENT 3840312 non-null float64
12 AMT_RECEIVABLE_PRINCIPAL 3840312 non-null float64
13 AMT_RECIVABLE 3840312 non-null float64
14 AMT_TOTAL_RECEIVABLE 3840312 non-null float64
15 CNT_DRAWINGS_ATM_CURRENT 3090496 non-null float64
16 CNT_DRAWINGS_CURRENT 3840312 non-null int64
17 CNT_DRAWINGS_OTHER_CURRENT 3090496 non-null float64
18 CNT_DRAWINGS_POS_CURRENT 3090496 non-null float64
19 CNT_INSTALMENT_MATURE_CUM 3535076 non-null float64
20 NAME_CONTRACT_STATUS 3840312 non-null object
21 SK_DPD 3840312 non-null int64
22 SK_DPD_DEF 3840312 non-null int64
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None
DATA DESCRIPTION:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE \
count 3.840312e+06 3.840312e+06 3.840312e+06 3.840312e+06
mean 1.904504e+06 2.783242e+05 -3.452192e+01 5.830016e+04
std 5.364695e+05 1.027045e+05 2.666775e+01 1.063070e+05
min 1.000018e+06 1.000060e+05 -9.600000e+01 -4.202502e+05
25% 1.434385e+06 1.895170e+05 -5.500000e+01 0.000000e+00
50% 1.897122e+06 2.783960e+05 -2.800000e+01 0.000000e+00
75% 2.369328e+06 3.675800e+05 -1.100000e+01 8.904669e+04
max 2.843496e+06 4.562500e+05 -1.000000e+00 1.505902e+06
AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT \
count 3.840312e+06 3.090496e+06
mean 1.538080e+05 5.961325e+03
std 1.651457e+05 2.822569e+04
min 0.000000e+00 -6.827310e+03
25% 4.500000e+04 0.000000e+00
50% 1.125000e+05 0.000000e+00
75% 1.800000e+05 0.000000e+00
max 1.350000e+06 2.115000e+06
AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT \
count 3.840312e+06 3.090496e+06
mean 7.433388e+03 2.881696e+02
std 3.384608e+04 8.201989e+03
min -6.211620e+03 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 2.287098e+06 1.529847e+06
AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... \
count 3.090496e+06 3.535076e+06 ...
mean 2.968805e+03 3.540204e+03 ...
std 2.079689e+04 5.600154e+03 ...
min 0.000000e+00 0.000000e+00 ...
25% 0.000000e+00 0.000000e+00 ...
50% 0.000000e+00 0.000000e+00 ...
75% 0.000000e+00 6.633911e+03 ...
max 2.239274e+06 2.028820e+05 ...
AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE \
count 3.840312e+06 3.840312e+06 3.840312e+06
mean 5.596588e+04 5.808881e+04 5.809829e+04
std 1.025336e+05 1.059654e+05 1.059718e+05
min -4.233058e+05 -4.202502e+05 -4.202502e+05
25% 0.000000e+00 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00 0.000000e+00
75% 8.535924e+04 8.889949e+04 8.891451e+04
max 1.472317e+06 1.493338e+06 1.493338e+06
CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT \
count 3.090496e+06 3.840312e+06
mean 3.094490e-01 7.031439e-01
std 1.100401e+00 3.190347e+00
min 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 5.100000e+01 1.650000e+02
CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT \
count 3.090496e+06 3.090496e+06
mean 4.812496e-03 5.594791e-01
std 8.263861e-02 3.240649e+00
min 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 1.200000e+01 1.650000e+02
CNT_INSTALMENT_MATURE_CUM SK_DPD SK_DPD_DEF
count 3.535076e+06 3.840312e+06 3.840312e+06
mean 2.082508e+01 9.283667e+00 3.316220e-01
std 2.005149e+01 9.751570e+01 2.147923e+01
min 0.000000e+00 0.000000e+00 0.000000e+00
25% 4.000000e+00 0.000000e+00 0.000000e+00
50% 1.500000e+01 0.000000e+00 0.000000e+00
75% 3.200000e+01 0.000000e+00 0.000000e+00
max 1.200000e+02 3.260000e+03 3.260000e+03
[8 rows x 22 columns]
*************************************************************************************
FILE: INSTALLMENTS_PAYMENTS
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13605401 entries, 0 to 13605400
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 13605401 non-null int64
1 SK_ID_CURR 13605401 non-null int64
2 NUM_INSTALMENT_VERSION 13605401 non-null float64
3 NUM_INSTALMENT_NUMBER 13605401 non-null int64
4 DAYS_INSTALMENT 13605401 non-null float64
5 DAYS_ENTRY_PAYMENT 13602496 non-null float64
6 AMT_INSTALMENT 13605401 non-null float64
7 AMT_PAYMENT 13602496 non-null float64
dtypes: float64(5), int64(3)
memory usage: 830.4 MB
None
DATA DESCRIPTION:
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION \
count 1.360540e+07 1.360540e+07 1.360540e+07
mean 1.903365e+06 2.784449e+05 8.566373e-01
std 5.362029e+05 1.027183e+05 1.035216e+00
min 1.000001e+06 1.000010e+05 0.000000e+00
25% 1.434191e+06 1.896390e+05 0.000000e+00
50% 1.896520e+06 2.786850e+05 1.000000e+00
75% 2.369094e+06 3.675300e+05 1.000000e+00
max 2.843499e+06 4.562550e+05 1.780000e+02
NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT \
count 1.360540e+07 1.360540e+07 1.360250e+07
mean 1.887090e+01 -1.042270e+03 -1.051114e+03
std 2.666407e+01 8.009463e+02 8.005859e+02
min 1.000000e+00 -2.922000e+03 -4.921000e+03
25% 4.000000e+00 -1.654000e+03 -1.662000e+03
50% 8.000000e+00 -8.180000e+02 -8.270000e+02
75% 1.900000e+01 -3.610000e+02 -3.700000e+02
max 2.770000e+02 -1.000000e+00 -1.000000e+00
AMT_INSTALMENT AMT_PAYMENT
count 1.360540e+07 1.360250e+07
mean 1.705091e+04 1.723822e+04
std 5.057025e+04 5.473578e+04
min 0.000000e+00 0.000000e+00
25% 4.226085e+03 3.398265e+03
50% 8.884080e+03 8.125515e+03
75% 1.671021e+04 1.610842e+04
max 3.771488e+06 3.771488e+06
*************************************************************************************
FILE: BUREAU_BALANCE
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_BUREAU 27299925 non-null int64
1 MONTHS_BALANCE 27299925 non-null int64
2 STATUS 27299925 non-null object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
DATA DESCRIPTION:
SK_ID_BUREAU MONTHS_BALANCE
count 2.729992e+07 2.729992e+07
mean 6.036297e+06 -3.074169e+01
std 4.923489e+05 2.386451e+01
min 5.001709e+06 -9.600000e+01
25% 5.730933e+06 -4.600000e+01
50% 6.070821e+06 -2.500000e+01
75% 6.431951e+06 -1.100000e+01
max 6.842888e+06 0.000000e+00
*************************************************************************************
FILE: APPLICATION_TRAIN
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 307511 non-null int64
1 TARGET 307511 non-null int64
2 NAME_CONTRACT_TYPE 307511 non-null object
3 CODE_GENDER 307511 non-null object
4 FLAG_OWN_CAR 307511 non-null object
5 FLAG_OWN_REALTY 307511 non-null object
6 CNT_CHILDREN 307511 non-null int64
7 AMT_INCOME_TOTAL 307511 non-null float64
8 AMT_CREDIT 307511 non-null float64
9 AMT_ANNUITY 307499 non-null float64
10 AMT_GOODS_PRICE 307233 non-null float64
11 NAME_TYPE_SUITE 306219 non-null object
12 NAME_INCOME_TYPE 307511 non-null object
13 NAME_EDUCATION_TYPE 307511 non-null object
14 NAME_FAMILY_STATUS 307511 non-null object
15 NAME_HOUSING_TYPE 307511 non-null object
16 REGION_POPULATION_RELATIVE 307511 non-null float64
17 DAYS_BIRTH 307511 non-null int64
18 DAYS_EMPLOYED 307511 non-null int64
19 DAYS_REGISTRATION 307511 non-null float64
20 DAYS_ID_PUBLISH 307511 non-null int64
21 OWN_CAR_AGE 104582 non-null float64
22 FLAG_MOBIL 307511 non-null int64
23 FLAG_EMP_PHONE 307511 non-null int64
24 FLAG_WORK_PHONE 307511 non-null int64
25 FLAG_CONT_MOBILE 307511 non-null int64
26 FLAG_PHONE 307511 non-null int64
27 FLAG_EMAIL 307511 non-null int64
28 OCCUPATION_TYPE 211120 non-null object
29 CNT_FAM_MEMBERS 307509 non-null float64
30 REGION_RATING_CLIENT 307511 non-null int64
31 REGION_RATING_CLIENT_W_CITY 307511 non-null int64
32 WEEKDAY_APPR_PROCESS_START 307511 non-null object
33 HOUR_APPR_PROCESS_START 307511 non-null int64
34 REG_REGION_NOT_LIVE_REGION 307511 non-null int64
35 REG_REGION_NOT_WORK_REGION 307511 non-null int64
36 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64
37 REG_CITY_NOT_LIVE_CITY 307511 non-null int64
38 REG_CITY_NOT_WORK_CITY 307511 non-null int64
39 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64
40 ORGANIZATION_TYPE 307511 non-null object
41 EXT_SOURCE_1 134133 non-null float64
42 EXT_SOURCE_2 306851 non-null float64
43 EXT_SOURCE_3 246546 non-null float64
44 APARTMENTS_AVG 151450 non-null float64
45 BASEMENTAREA_AVG 127568 non-null float64
46 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64
47 YEARS_BUILD_AVG 103023 non-null float64
48 COMMONAREA_AVG 92646 non-null float64
49 ELEVATORS_AVG 143620 non-null float64
50 ENTRANCES_AVG 152683 non-null float64
51 FLOORSMAX_AVG 154491 non-null float64
52 FLOORSMIN_AVG 98869 non-null float64
53 LANDAREA_AVG 124921 non-null float64
54 LIVINGAPARTMENTS_AVG 97312 non-null float64
55 LIVINGAREA_AVG 153161 non-null float64
56 NONLIVINGAPARTMENTS_AVG 93997 non-null float64
57 NONLIVINGAREA_AVG 137829 non-null float64
58 APARTMENTS_MODE 151450 non-null float64
59 BASEMENTAREA_MODE 127568 non-null float64
60 YEARS_BEGINEXPLUATATION_MODE 157504 non-null float64
61 YEARS_BUILD_MODE 103023 non-null float64
62 COMMONAREA_MODE 92646 non-null float64
63 ELEVATORS_MODE 143620 non-null float64
64 ENTRANCES_MODE 152683 non-null float64
65 FLOORSMAX_MODE 154491 non-null float64
66 FLOORSMIN_MODE 98869 non-null float64
67 LANDAREA_MODE 124921 non-null float64
68 LIVINGAPARTMENTS_MODE 97312 non-null float64
69 LIVINGAREA_MODE 153161 non-null float64
70 NONLIVINGAPARTMENTS_MODE 93997 non-null float64
71 NONLIVINGAREA_MODE 137829 non-null float64
72 APARTMENTS_MEDI 151450 non-null float64
73 BASEMENTAREA_MEDI 127568 non-null float64
74 YEARS_BEGINEXPLUATATION_MEDI 157504 non-null float64
75 YEARS_BUILD_MEDI 103023 non-null float64
76 COMMONAREA_MEDI 92646 non-null float64
77 ELEVATORS_MEDI 143620 non-null float64
78 ENTRANCES_MEDI 152683 non-null float64
79 FLOORSMAX_MEDI 154491 non-null float64
80 FLOORSMIN_MEDI 98869 non-null float64
81 LANDAREA_MEDI 124921 non-null float64
82 LIVINGAPARTMENTS_MEDI 97312 non-null float64
83 LIVINGAREA_MEDI 153161 non-null float64
84 NONLIVINGAPARTMENTS_MEDI 93997 non-null float64
85 NONLIVINGAREA_MEDI 137829 non-null float64
86 FONDKAPREMONT_MODE 97216 non-null object
87 HOUSETYPE_MODE 153214 non-null object
88 TOTALAREA_MODE 159080 non-null float64
89 WALLSMATERIAL_MODE 151170 non-null object
90 EMERGENCYSTATE_MODE 161756 non-null object
91 OBS_30_CNT_SOCIAL_CIRCLE 306490 non-null float64
92 DEF_30_CNT_SOCIAL_CIRCLE 306490 non-null float64
93 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64
94 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64
95 DAYS_LAST_PHONE_CHANGE 307510 non-null float64
96 FLAG_DOCUMENT_2 307511 non-null int64
97 FLAG_DOCUMENT_3 307511 non-null int64
98 FLAG_DOCUMENT_4 307511 non-null int64
99 FLAG_DOCUMENT_5 307511 non-null int64
100 FLAG_DOCUMENT_6 307511 non-null int64
101 FLAG_DOCUMENT_7 307511 non-null int64
102 FLAG_DOCUMENT_8 307511 non-null int64
103 FLAG_DOCUMENT_9 307511 non-null int64
104 FLAG_DOCUMENT_10 307511 non-null int64
105 FLAG_DOCUMENT_11 307511 non-null int64
106 FLAG_DOCUMENT_12 307511 non-null int64
107 FLAG_DOCUMENT_13 307511 non-null int64
108 FLAG_DOCUMENT_14 307511 non-null int64
109 FLAG_DOCUMENT_15 307511 non-null int64
110 FLAG_DOCUMENT_16 307511 non-null int64
111 FLAG_DOCUMENT_17 307511 non-null int64
112 FLAG_DOCUMENT_18 307511 non-null int64
113 FLAG_DOCUMENT_19 307511 non-null int64
114 FLAG_DOCUMENT_20 307511 non-null int64
115 FLAG_DOCUMENT_21 307511 non-null int64
116 AMT_REQ_CREDIT_BUREAU_HOUR 265992 non-null float64
117 AMT_REQ_CREDIT_BUREAU_DAY 265992 non-null float64
118 AMT_REQ_CREDIT_BUREAU_WEEK 265992 non-null float64
119 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64
120 AMT_REQ_CREDIT_BUREAU_QRT 265992 non-null float64
121 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
DATA DESCRIPTION:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL \
count 307511.000000 307511.000000 307511.000000 3.075110e+05
mean 278180.518577 0.080729 0.417052 1.687979e+05
std 102790.175348 0.272419 0.722121 2.371231e+05
min 100002.000000 0.000000 0.000000 2.565000e+04
25% 189145.500000 0.000000 0.000000 1.125000e+05
50% 278202.000000 0.000000 0.000000 1.471500e+05
75% 367142.500000 0.000000 1.000000 2.025000e+05
max 456255.000000 1.000000 19.000000 1.170000e+08
AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE \
count 3.075110e+05 307499.000000 3.072330e+05
mean 5.990260e+05 27108.573909 5.383962e+05
std 4.024908e+05 14493.737315 3.694465e+05
min 4.500000e+04 1615.500000 4.050000e+04
25% 2.700000e+05 16524.000000 2.385000e+05
50% 5.135310e+05 24903.000000 4.500000e+05
75% 8.086500e+05 34596.000000 6.795000e+05
max 4.050000e+06 258025.500000 4.050000e+06
REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED ... \
count 307511.000000 307511.000000 307511.000000 ...
mean 0.020868 -16036.995067 63815.045904 ...
std 0.013831 4363.988632 141275.766519 ...
min 0.000290 -25229.000000 -17912.000000 ...
25% 0.010006 -19682.000000 -2760.000000 ...
50% 0.018850 -15750.000000 -1213.000000 ...
75% 0.028663 -12413.000000 -289.000000 ...
max 0.072508 -7489.000000 365243.000000 ...
FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
count 307511.000000 307511.000000 307511.000000 307511.000000
mean 0.008130 0.000595 0.000507 0.000335
std 0.089798 0.024387 0.022518 0.018299
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY \
count 265992.000000 265992.000000
mean 0.006402 0.007000
std 0.083849 0.110757
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 4.000000 9.000000
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON \
count 265992.000000 265992.000000
mean 0.034362 0.267395
std 0.204685 0.916002
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 8.000000 27.000000
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 265992.000000 265992.000000
mean 0.265474 1.899974
std 0.794056 1.869295
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 1.000000
75% 0.000000 3.000000
max 261.000000 25.000000
[8 rows x 106 columns]
*************************************************************************************
FILE: POS_CASH_BALANCE
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 10001358 non-null int64
1 SK_ID_CURR 10001358 non-null int64
2 MONTHS_BALANCE 10001358 non-null int64
3 CNT_INSTALMENT 9975287 non-null float64
4 CNT_INSTALMENT_FUTURE 9975271 non-null float64
5 NAME_CONTRACT_STATUS 10001358 non-null object
6 SK_DPD 10001358 non-null int64
7 SK_DPD_DEF 10001358 non-null int64
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
DATA DESCRIPTION:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT \
count 1.000136e+07 1.000136e+07 1.000136e+07 9.975287e+06
mean 1.903217e+06 2.784039e+05 -3.501259e+01 1.708965e+01
std 5.358465e+05 1.027637e+05 2.606657e+01 1.199506e+01
min 1.000001e+06 1.000010e+05 -9.600000e+01 1.000000e+00
25% 1.434405e+06 1.895500e+05 -5.400000e+01 1.000000e+01
50% 1.896565e+06 2.786540e+05 -2.800000e+01 1.200000e+01
75% 2.368963e+06 3.674290e+05 -1.300000e+01 2.400000e+01
max 2.843499e+06 4.562550e+05 -1.000000e+00 9.200000e+01
CNT_INSTALMENT_FUTURE SK_DPD SK_DPD_DEF
count 9.975271e+06 1.000136e+07 1.000136e+07
mean 1.048384e+01 1.160693e+01 6.544684e-01
std 1.110906e+01 1.327140e+02 3.276249e+01
min 0.000000e+00 0.000000e+00 0.000000e+00
25% 3.000000e+00 0.000000e+00 0.000000e+00
50% 7.000000e+00 0.000000e+00 0.000000e+00
75% 1.400000e+01 0.000000e+00 0.000000e+00
max 8.500000e+01 4.231000e+03 3.595000e+03
*************************************************************************************
FILE: APPLICATION_TEST
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 121 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 48744 non-null int64
1 NAME_CONTRACT_TYPE 48744 non-null object
2 CODE_GENDER 48744 non-null object
3 FLAG_OWN_CAR 48744 non-null object
4 FLAG_OWN_REALTY 48744 non-null object
5 CNT_CHILDREN 48744 non-null int64
6 AMT_INCOME_TOTAL 48744 non-null float64
7 AMT_CREDIT 48744 non-null float64
8 AMT_ANNUITY 48720 non-null float64
9 AMT_GOODS_PRICE 48744 non-null float64
10 NAME_TYPE_SUITE 47833 non-null object
11 NAME_INCOME_TYPE 48744 non-null object
12 NAME_EDUCATION_TYPE 48744 non-null object
13 NAME_FAMILY_STATUS 48744 non-null object
14 NAME_HOUSING_TYPE 48744 non-null object
15 REGION_POPULATION_RELATIVE 48744 non-null float64
16 DAYS_BIRTH 48744 non-null int64
17 DAYS_EMPLOYED 48744 non-null int64
18 DAYS_REGISTRATION 48744 non-null float64
19 DAYS_ID_PUBLISH 48744 non-null int64
20 OWN_CAR_AGE 16432 non-null float64
21 FLAG_MOBIL 48744 non-null int64
22 FLAG_EMP_PHONE 48744 non-null int64
23 FLAG_WORK_PHONE 48744 non-null int64
24 FLAG_CONT_MOBILE 48744 non-null int64
25 FLAG_PHONE 48744 non-null int64
26 FLAG_EMAIL 48744 non-null int64
27 OCCUPATION_TYPE 33139 non-null object
28 CNT_FAM_MEMBERS 48744 non-null float64
29 REGION_RATING_CLIENT 48744 non-null int64
30 REGION_RATING_CLIENT_W_CITY 48744 non-null int64
31 WEEKDAY_APPR_PROCESS_START 48744 non-null object
32 HOUR_APPR_PROCESS_START 48744 non-null int64
33 REG_REGION_NOT_LIVE_REGION 48744 non-null int64
34 REG_REGION_NOT_WORK_REGION 48744 non-null int64
35 LIVE_REGION_NOT_WORK_REGION 48744 non-null int64
36 REG_CITY_NOT_LIVE_CITY 48744 non-null int64
37 REG_CITY_NOT_WORK_CITY 48744 non-null int64
38 LIVE_CITY_NOT_WORK_CITY 48744 non-null int64
39 ORGANIZATION_TYPE 48744 non-null object
40 EXT_SOURCE_1 28212 non-null float64
41 EXT_SOURCE_2 48736 non-null float64
42 EXT_SOURCE_3 40076 non-null float64
43 APARTMENTS_AVG 24857 non-null float64
44 BASEMENTAREA_AVG 21103 non-null float64
45 YEARS_BEGINEXPLUATATION_AVG 25888 non-null float64
46 YEARS_BUILD_AVG 16926 non-null float64
47 COMMONAREA_AVG 15249 non-null float64
48 ELEVATORS_AVG 23555 non-null float64
49 ENTRANCES_AVG 25165 non-null float64
50 FLOORSMAX_AVG 25423 non-null float64
51 FLOORSMIN_AVG 16278 non-null float64
52 LANDAREA_AVG 20490 non-null float64
53 LIVINGAPARTMENTS_AVG 15964 non-null float64
54 LIVINGAREA_AVG 25192 non-null float64
55 NONLIVINGAPARTMENTS_AVG 15397 non-null float64
56 NONLIVINGAREA_AVG 22660 non-null float64
57 APARTMENTS_MODE 24857 non-null float64
58 BASEMENTAREA_MODE 21103 non-null float64
59 YEARS_BEGINEXPLUATATION_MODE 25888 non-null float64
60 YEARS_BUILD_MODE 16926 non-null float64
61 COMMONAREA_MODE 15249 non-null float64
62 ELEVATORS_MODE 23555 non-null float64
63 ENTRANCES_MODE 25165 non-null float64
64 FLOORSMAX_MODE 25423 non-null float64
65 FLOORSMIN_MODE 16278 non-null float64
66 LANDAREA_MODE 20490 non-null float64
67 LIVINGAPARTMENTS_MODE 15964 non-null float64
68 LIVINGAREA_MODE 25192 non-null float64
69 NONLIVINGAPARTMENTS_MODE 15397 non-null float64
70 NONLIVINGAREA_MODE 22660 non-null float64
71 APARTMENTS_MEDI 24857 non-null float64
72 BASEMENTAREA_MEDI 21103 non-null float64
73 YEARS_BEGINEXPLUATATION_MEDI 25888 non-null float64
74 YEARS_BUILD_MEDI 16926 non-null float64
75 COMMONAREA_MEDI 15249 non-null float64
76 ELEVATORS_MEDI 23555 non-null float64
77 ENTRANCES_MEDI 25165 non-null float64
78 FLOORSMAX_MEDI 25423 non-null float64
79 FLOORSMIN_MEDI 16278 non-null float64
80 LANDAREA_MEDI 20490 non-null float64
81 LIVINGAPARTMENTS_MEDI 15964 non-null float64
82 LIVINGAREA_MEDI 25192 non-null float64
83 NONLIVINGAPARTMENTS_MEDI 15397 non-null float64
84 NONLIVINGAREA_MEDI 22660 non-null float64
85 FONDKAPREMONT_MODE 15947 non-null object
86 HOUSETYPE_MODE 25125 non-null object
87 TOTALAREA_MODE 26120 non-null float64
88 WALLSMATERIAL_MODE 24851 non-null object
89 EMERGENCYSTATE_MODE 26535 non-null object
90 OBS_30_CNT_SOCIAL_CIRCLE 48715 non-null float64
91 DEF_30_CNT_SOCIAL_CIRCLE 48715 non-null float64
92 OBS_60_CNT_SOCIAL_CIRCLE 48715 non-null float64
93 DEF_60_CNT_SOCIAL_CIRCLE 48715 non-null float64
94 DAYS_LAST_PHONE_CHANGE 48744 non-null float64
95 FLAG_DOCUMENT_2 48744 non-null int64
96 FLAG_DOCUMENT_3 48744 non-null int64
97 FLAG_DOCUMENT_4 48744 non-null int64
98 FLAG_DOCUMENT_5 48744 non-null int64
99 FLAG_DOCUMENT_6 48744 non-null int64
100 FLAG_DOCUMENT_7 48744 non-null int64
101 FLAG_DOCUMENT_8 48744 non-null int64
102 FLAG_DOCUMENT_9 48744 non-null int64
103 FLAG_DOCUMENT_10 48744 non-null int64
104 FLAG_DOCUMENT_11 48744 non-null int64
105 FLAG_DOCUMENT_12 48744 non-null int64
106 FLAG_DOCUMENT_13 48744 non-null int64
107 FLAG_DOCUMENT_14 48744 non-null int64
108 FLAG_DOCUMENT_15 48744 non-null int64
109 FLAG_DOCUMENT_16 48744 non-null int64
110 FLAG_DOCUMENT_17 48744 non-null int64
111 FLAG_DOCUMENT_18 48744 non-null int64
112 FLAG_DOCUMENT_19 48744 non-null int64
113 FLAG_DOCUMENT_20 48744 non-null int64
114 FLAG_DOCUMENT_21 48744 non-null int64
115 AMT_REQ_CREDIT_BUREAU_HOUR 42695 non-null float64
116 AMT_REQ_CREDIT_BUREAU_DAY 42695 non-null float64
117 AMT_REQ_CREDIT_BUREAU_WEEK 42695 non-null float64
118 AMT_REQ_CREDIT_BUREAU_MON 42695 non-null float64
119 AMT_REQ_CREDIT_BUREAU_QRT 42695 non-null float64
120 AMT_REQ_CREDIT_BUREAU_YEAR 42695 non-null float64
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
DATA DESCRIPTION:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT \
count 48744.000000 48744.000000 4.874400e+04 4.874400e+04
mean 277796.676350 0.397054 1.784318e+05 5.167404e+05
std 103169.547296 0.709047 1.015226e+05 3.653970e+05
min 100001.000000 0.000000 2.694150e+04 4.500000e+04
25% 188557.750000 0.000000 1.125000e+05 2.606400e+05
50% 277549.000000 0.000000 1.575000e+05 4.500000e+05
75% 367555.500000 1.000000 2.250000e+05 6.750000e+05
max 456250.000000 20.000000 4.410000e+06 2.245500e+06
AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE \
count 48720.000000 4.874400e+04 48744.000000
mean 29426.240209 4.626188e+05 0.021226
std 16016.368315 3.367102e+05 0.014428
min 2295.000000 4.500000e+04 0.000253
25% 17973.000000 2.250000e+05 0.010006
50% 26199.000000 3.960000e+05 0.018850
75% 37390.500000 6.300000e+05 0.028663
max 180576.000000 2.245500e+06 0.072508
DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION ... FLAG_DOCUMENT_18 \
count 48744.000000 48744.000000 48744.000000 ... 48744.000000
mean -16068.084605 67485.366322 -4967.652716 ... 0.001559
std 4325.900393 144348.507136 3552.612035 ... 0.039456
min -25195.000000 -17463.000000 -23722.000000 ... 0.000000
25% -19637.000000 -2910.000000 -7459.250000 ... 0.000000
50% -15785.000000 -1293.000000 -4490.000000 ... 0.000000
75% -12496.000000 -296.000000 -1901.000000 ... 0.000000
max -7338.000000 365243.000000 0.000000 ... 1.000000
FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
count 48744.0 48744.0 48744.0
mean 0.0 0.0 0.0
std 0.0 0.0 0.0
min 0.0 0.0 0.0
25% 0.0 0.0 0.0
50% 0.0 0.0 0.0
75% 0.0 0.0 0.0
max 0.0 0.0 0.0
AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY \
count 42695.000000 42695.000000
mean 0.002108 0.001803
std 0.046373 0.046132
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 2.000000 2.000000
AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON \
count 42695.000000 42695.000000
mean 0.002787 0.009299
std 0.054037 0.110924
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 2.000000 6.000000
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 42695.000000 42695.000000
mean 0.546902 1.983769
std 0.693305 1.838873
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 2.000000
75% 1.000000 3.000000
max 7.000000 17.000000
[8 rows x 105 columns]
*************************************************************************************
FILE: BUREAU
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 1716428 non-null int64
1 SK_ID_BUREAU 1716428 non-null int64
2 CREDIT_ACTIVE 1716428 non-null object
3 CREDIT_CURRENCY 1716428 non-null object
4 DAYS_CREDIT 1716428 non-null int64
5 CREDIT_DAY_OVERDUE 1716428 non-null int64
6 DAYS_CREDIT_ENDDATE 1610875 non-null float64
7 DAYS_ENDDATE_FACT 1082775 non-null float64
8 AMT_CREDIT_MAX_OVERDUE 591940 non-null float64
9 CNT_CREDIT_PROLONG 1716428 non-null int64
10 AMT_CREDIT_SUM 1716415 non-null float64
11 AMT_CREDIT_SUM_DEBT 1458759 non-null float64
12 AMT_CREDIT_SUM_LIMIT 1124648 non-null float64
13 AMT_CREDIT_SUM_OVERDUE 1716428 non-null float64
14 CREDIT_TYPE 1716428 non-null object
15 DAYS_CREDIT_UPDATE 1716428 non-null int64
16 AMT_ANNUITY 489637 non-null float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
DATA DESCRIPTION:
SK_ID_CURR SK_ID_BUREAU DAYS_CREDIT CREDIT_DAY_OVERDUE \
count 1.716428e+06 1.716428e+06 1.716428e+06 1.716428e+06
mean 2.782149e+05 5.924434e+06 -1.142108e+03 8.181666e-01
std 1.029386e+05 5.322657e+05 7.951649e+02 3.654443e+01
min 1.000010e+05 5.000000e+06 -2.922000e+03 0.000000e+00
25% 1.888668e+05 5.463954e+06 -1.666000e+03 0.000000e+00
50% 2.780550e+05 5.926304e+06 -9.870000e+02 0.000000e+00
75% 3.674260e+05 6.385681e+06 -4.740000e+02 0.000000e+00
max 4.562550e+05 6.843457e+06 0.000000e+00 2.792000e+03
DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE \
count 1.610875e+06 1.082775e+06 5.919400e+05
mean 5.105174e+02 -1.017437e+03 3.825418e+03
std 4.994220e+03 7.140106e+02 2.060316e+05
min -4.206000e+04 -4.202300e+04 0.000000e+00
25% -1.138000e+03 -1.489000e+03 0.000000e+00
50% -3.300000e+02 -8.970000e+02 0.000000e+00
75% 4.740000e+02 -4.250000e+02 0.000000e+00
max 3.119900e+04 0.000000e+00 1.159872e+08
CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT \
count 1.716428e+06 1.716415e+06 1.458759e+06
mean 6.410406e-03 3.549946e+05 1.370851e+05
std 9.622391e-02 1.149811e+06 6.774011e+05
min 0.000000e+00 0.000000e+00 -4.705600e+06
25% 0.000000e+00 5.130000e+04 0.000000e+00
50% 0.000000e+00 1.255185e+05 0.000000e+00
75% 0.000000e+00 3.150000e+05 4.015350e+04
max 9.000000e+00 5.850000e+08 1.701000e+08
AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE DAYS_CREDIT_UPDATE \
count 1.124648e+06 1.716428e+06 1.716428e+06
mean 6.229515e+03 3.791276e+01 -5.937483e+02
std 4.503203e+04 5.937650e+03 7.207473e+02
min -5.864061e+05 0.000000e+00 -4.194700e+04
25% 0.000000e+00 0.000000e+00 -9.080000e+02
50% 0.000000e+00 0.000000e+00 -3.950000e+02
75% 0.000000e+00 0.000000e+00 -3.300000e+01
max 4.705600e+06 3.756681e+06 3.720000e+02
AMT_ANNUITY
count 4.896370e+05
mean 1.571276e+04
std 3.258269e+05
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 1.350000e+04
max 1.184534e+08
*************************************************************************************
FILE: PREVIOUS_APPLICATION
--------------------------
INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_PREV 1670214 non-null int64
1 SK_ID_CURR 1670214 non-null int64
2 NAME_CONTRACT_TYPE 1670214 non-null object
3 AMT_ANNUITY 1297979 non-null float64
4 AMT_APPLICATION 1670214 non-null float64
5 AMT_CREDIT 1670213 non-null float64
6 AMT_DOWN_PAYMENT 774370 non-null float64
7 AMT_GOODS_PRICE 1284699 non-null float64
8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object
9 HOUR_APPR_PROCESS_START 1670214 non-null int64
10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object
11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64
12 RATE_DOWN_PAYMENT 774370 non-null float64
13 RATE_INTEREST_PRIMARY 5951 non-null float64
14 RATE_INTEREST_PRIVILEGED 5951 non-null float64
15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object
16 NAME_CONTRACT_STATUS 1670214 non-null object
17 DAYS_DECISION 1670214 non-null int64
18 NAME_PAYMENT_TYPE 1670214 non-null object
19 CODE_REJECT_REASON 1670214 non-null object
20 NAME_TYPE_SUITE 849809 non-null object
21 NAME_CLIENT_TYPE 1670214 non-null object
22 NAME_GOODS_CATEGORY 1670214 non-null object
23 NAME_PORTFOLIO 1670214 non-null object
24 NAME_PRODUCT_TYPE 1670214 non-null object
25 CHANNEL_TYPE 1670214 non-null object
26 SELLERPLACE_AREA 1670214 non-null int64
27 NAME_SELLER_INDUSTRY 1670214 non-null object
28 CNT_PAYMENT 1297984 non-null float64
29 NAME_YIELD_GROUP 1670214 non-null object
30 PRODUCT_COMBINATION 1669868 non-null object
31 DAYS_FIRST_DRAWING 997149 non-null float64
32 DAYS_FIRST_DUE 997149 non-null float64
33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64
34 DAYS_LAST_DUE 997149 non-null float64
35 DAYS_TERMINATION 997149 non-null float64
36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None
DATA DESCRIPTION:
SK_ID_PREV SK_ID_CURR AMT_ANNUITY AMT_APPLICATION \
count 1.670214e+06 1.670214e+06 1.297979e+06 1.670214e+06
mean 1.923089e+06 2.783572e+05 1.595512e+04 1.752339e+05
std 5.325980e+05 1.028148e+05 1.478214e+04 2.927798e+05
min 1.000001e+06 1.000010e+05 0.000000e+00 0.000000e+00
25% 1.461857e+06 1.893290e+05 6.321780e+03 1.872000e+04
50% 1.923110e+06 2.787145e+05 1.125000e+04 7.104600e+04
75% 2.384280e+06 3.675140e+05 2.065842e+04 1.803600e+05
max 2.845382e+06 4.562550e+05 4.180581e+05 6.905160e+06
AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE \
count 1.670213e+06 7.743700e+05 1.284699e+06
mean 1.961140e+05 6.697402e+03 2.278473e+05
std 3.185746e+05 2.092150e+04 3.153966e+05
min 0.000000e+00 -9.000000e-01 0.000000e+00
25% 2.416050e+04 0.000000e+00 5.084100e+04
50% 8.054100e+04 1.638000e+03 1.123200e+05
75% 2.164185e+05 7.740000e+03 2.340000e+05
max 6.905160e+06 3.060045e+06 6.905160e+06
HOUR_APPR_PROCESS_START NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT \
count 1.670214e+06 1.670214e+06 774370.000000
mean 1.248418e+01 9.964675e-01 0.079637
std 3.334028e+00 5.932963e-02 0.107823
min 0.000000e+00 0.000000e+00 -0.000015
25% 1.000000e+01 1.000000e+00 0.000000
50% 1.200000e+01 1.000000e+00 0.051605
75% 1.500000e+01 1.000000e+00 0.108909
max 2.300000e+01 1.000000e+00 1.000000
... RATE_INTEREST_PRIVILEGED DAYS_DECISION SELLERPLACE_AREA \
count ... 5951.000000 1.670214e+06 1.670214e+06
mean ... 0.773503 -8.806797e+02 3.139511e+02
std ... 0.100879 7.790997e+02 7.127443e+03
min ... 0.373150 -2.922000e+03 -1.000000e+00
25% ... 0.715645 -1.300000e+03 -1.000000e+00
50% ... 0.835095 -5.810000e+02 3.000000e+00
75% ... 0.852537 -2.800000e+02 8.200000e+01
max ... 1.000000 -1.000000e+00 4.000000e+06
CNT_PAYMENT DAYS_FIRST_DRAWING DAYS_FIRST_DUE \
count 1.297984e+06 997149.000000 997149.000000
mean 1.605408e+01 342209.855039 13826.269337
std 1.456729e+01 88916.115833 72444.869708
min 0.000000e+00 -2922.000000 -2892.000000
25% 6.000000e+00 365243.000000 -1628.000000
50% 1.200000e+01 365243.000000 -831.000000
75% 2.400000e+01 365243.000000 -411.000000
max 8.400000e+01 365243.000000 365243.000000
DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION \
count 997149.000000 997149.000000 997149.000000
mean 33767.774054 76582.403064 81992.343838
std 106857.034789 149647.415123 153303.516729
min -2801.000000 -2889.000000 -2874.000000
25% -1242.000000 -1314.000000 -1270.000000
50% -361.000000 -537.000000 -499.000000
75% 129.000000 -74.000000 -44.000000
max 365243.000000 365243.000000 365243.000000
NFLAG_INSURED_ON_APPROVAL
count 997149.000000
mean 0.332570
std 0.471134
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000
[8 rows x 21 columns]
*************************************************************************************
#application_train
train = datasets['application_train']
corrs = pd.DataFrame(train.corr()['TARGET']).rename(columns={"TARGET":"cor"})
corrs["abs_corr"] = corrs.abs()
corrs = corrs.sort_values("cor")
print(corrs)
cor abs_corr EXT_SOURCE_3 -0.178919 0.178919 EXT_SOURCE_2 -0.160472 0.160472 EXT_SOURCE_1 -0.155317 0.155317 DAYS_EMPLOYED -0.044932 0.044932 FLOORSMAX_AVG -0.044003 0.044003 ... ... ... DAYS_LAST_PHONE_CHANGE 0.055218 0.055218 REGION_RATING_CLIENT 0.058899 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 0.060893 DAYS_BIRTH 0.078239 0.078239 TARGET 1.000000 1.000000 [106 rows x 2 columns]
# Top Correlated Features
print("10 Most positive correlations to Target:")
print("-------------------------------------------------------")
print(corrs["cor"].tail(10))
print()
print("\n10 Most negative correlations to Target:")
print("-------------------------------------------------------")
print(corrs["cor"].head(10))
print()
print("\n10 Most correlated to Target (by absolute value):")
print("-------------------------------------------------------")
top_10_corrs = corrs.sort_values("abs_corr", ascending=False).head(11)
print(top_10_corrs)
10 Most positive correlations to Target:
-------------------------------------------------------
FLAG_DOCUMENT_3 0.044346
REG_CITY_NOT_LIVE_CITY 0.044395
FLAG_EMP_PHONE 0.045982
REG_CITY_NOT_WORK_CITY 0.050994
DAYS_ID_PUBLISH 0.051457
DAYS_LAST_PHONE_CHANGE 0.055218
REGION_RATING_CLIENT 0.058899
REGION_RATING_CLIENT_W_CITY 0.060893
DAYS_BIRTH 0.078239
TARGET 1.000000
Name: cor, dtype: float64
10 Most negative correlations to Target:
-------------------------------------------------------
EXT_SOURCE_3 -0.178919
EXT_SOURCE_2 -0.160472
EXT_SOURCE_1 -0.155317
DAYS_EMPLOYED -0.044932
FLOORSMAX_AVG -0.044003
FLOORSMAX_MEDI -0.043768
FLOORSMAX_MODE -0.043226
AMT_GOODS_PRICE -0.039645
REGION_POPULATION_RELATIVE -0.037227
ELEVATORS_AVG -0.034199
Name: cor, dtype: float64
10 Most correlated to Target (by absolute value):
-------------------------------------------------------
cor abs_corr
TARGET 1.000000 1.000000
EXT_SOURCE_3 -0.178919 0.178919
EXT_SOURCE_2 -0.160472 0.160472
EXT_SOURCE_1 -0.155317 0.155317
DAYS_BIRTH 0.078239 0.078239
REGION_RATING_CLIENT_W_CITY 0.060893 0.060893
REGION_RATING_CLIENT 0.058899 0.058899
DAYS_LAST_PHONE_CHANGE 0.055218 0.055218
DAYS_ID_PUBLISH 0.051457 0.051457
REG_CITY_NOT_WORK_CITY 0.050994 0.050994
FLAG_EMP_PHONE 0.045982 0.045982
Other logical numerical variables to consider:
Total income, Amount of credit are all impactful factors on a person's likelihood of paying back a loan.
#Update Numerical Features List to account for Correlation and Logic
selected_num_features = list(top_10_corrs.index)
other_num_features = ['AMT_INCOME_TOTAL','AMT_CREDIT']
for feature in other_num_features:
selected_num_features.append(feature)
print("Updated Numerical Features: \n")
for col in selected_num_features:
print(col)
print(f"\n# of Updated Numerical Features Based on High Correlation: {len(selected_num_features)}")
Updated Numerical Features: TARGET EXT_SOURCE_3 EXT_SOURCE_2 EXT_SOURCE_1 DAYS_BIRTH REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT DAYS_LAST_PHONE_CHANGE DAYS_ID_PUBLISH REG_CITY_NOT_WORK_CITY FLAG_EMP_PHONE AMT_INCOME_TOTAL AMT_CREDIT # of Updated Numerical Features Based on High Correlation: 13
#Distribution Plots of highest correlated input variables.
# selected_num_features.remove('TARGET')
cnt_cols = len(selected_num_features)
plt.figure(figsize = (20,40))
for i, var in enumerate (selected_num_features):
plt.subplot(cnt_cols,5, i+1)
datasets["application_train"][var].hist()
plt.title (var)
plt.tight_layout()
plt.show()
#Correlation Heatmap of top most correlated variables with Target
# selected_num_features.insert(0, 'TARGET')
selected_num_features_df = train[selected_num_features]
#Correlation Matrix
selected_num_features_cm = selected_num_features_df.corr()
#Plot Correlation Matrix as a heatmap
mask = np.triu(selected_num_features_cm)
plt.figure(figsize=(20,20))
sns.heatmap(selected_num_features_cm, cmap=plt.cm.coolwarm, annot=True, mask=mask )
plt.title("Correlation Heatmap of Top Correlated Features to Target in application_train")
Text(0.5, 1.0, 'Correlation Heatmap of Top Correlated Features to Target in application_train')
# Reference: "https://www.geeksforgeeks.org/sort-correlation-matrix-in-python/".
def get_top_abs_correlations(cm):
# Retain upper triangular values of correlation matrix and
# make Lower triangular values Null
upper_corr_mat = cm.where(np.triu(np.ones(cm.shape),k=1).astype(np.bool))
# Convert to 1-D series and drop Null values
unique_corr_pairs = upper_corr_mat.unstack().dropna()
# Sort correlation pairs
sorted_mat = unique_corr_pairs.abs().sort_values()
return (sorted_mat[sorted_mat > 0.7])
top_abs_corrs = pd.DataFrame(get_top_abs_correlations(selected_num_features_cm))
print("Absolute Correlations > 0.7 Pearson Coefficent:")
top_abs_corrs.columns = ['Correlation Factor']
print(top_abs_corrs)
Absolute Correlations > 0.7 Pearson Coefficent:
Correlation Factor
REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY 0.950842
top_abs_corrs ['Feature 1 Correlation with Target']=0
top_abs_corrs ['Feature 2 Correlation with Target']=0
i=0
for feature in top_abs_corrs.index:
top_abs_corrs ['Feature 1 Correlation with Target'].iloc[i] = selected_num_features_cm['TARGET'].loc[feature[0]]
top_abs_corrs ['Feature 2 Correlation with Target'].iloc[i] = selected_num_features_cm['TARGET'].loc[feature[1]]
i+=1
top_abs_corrs
| Correlation Factor | Feature 1 Correlation with Target | Feature 2 Correlation with Target | ||
|---|---|---|---|---|
| REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | 0.950842 | 0.058899 | 0.060893 |
In the updated selected numerical columns selected from the Application Training dataset, some of the input features were highly correlated with each other. We considered a 'high correlation' when the Pearson Correlation Factor was greater than 0.7, per industry standard.
Feature Selection For these highly correlated input features, we selected one input feature per pair based on the highest correlation factor with the target variable and best judgement if this factor is the same.
Input Features to drop:
inputs_to_drop = ['TARGET','REGION_RATING_CLIENT','FLAG_EMP_PHONE']
for input_var in inputs_to_drop:
selected_num_features.remove(input_var)
print("Updated Numerical Features based on Correlation Accounting for Multicollinearity:".upper())
print("-------------------------------------------------------------------------------------")
for col in selected_num_features:
print(col)
print(f"\n# of Variables Listed Above: {len(selected_num_features)}")
UPDATED NUMERICAL FEATURES BASED ON CORRELATION ACCOUNTING FOR MULTICOLLINEARITY: ------------------------------------------------------------------------------------- EXT_SOURCE_3 EXT_SOURCE_2 EXT_SOURCE_1 DAYS_BIRTH REGION_RATING_CLIENT_W_CITY DAYS_LAST_PHONE_CHANGE DAYS_ID_PUBLISH REG_CITY_NOT_WORK_CITY AMT_INCOME_TOTAL AMT_CREDIT # of Variables Listed Above: 10
selected_cat_features = []
for col in train:
if train[col].dtype == 'object':
selected_cat_features.append(col)
#Print Categorical Features
print("Categorical Features:")
print("---------------------")
for col in selected_cat_features:
print(col)
selected_cat_features_len = len(selected_cat_features)
print(f"\n# of Categorical Features: {selected_cat_features_len}\n")
Categorical Features: --------------------- NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE # of Categorical Features: 16
import math
plt.figure(figsize=(10,12)),
fig_rows = math.ceil(selected_cat_features_len/2)
fig, ax = plt.subplots(fig_rows, 2, figsize = (20,50))
col_index = 0
for idx, cat in enumerate(selected_cat_features):
plt.subplot(fig_rows, 2, idx+1)
sns.countplot(train[cat], hue=train['TARGET'])
plt.title(f"Distribution of Variable: {cat}")
plt.xticks(rotation=90)
plt.tight_layout()
<Figure size 720x864 with 0 Axes>
Based on the the above histograms, the following categorical variables will be dropped:
inputs_to_drop = ['NAME_TYPE_SUITE', 'NAME_HOUSING_TYPE', 'WEEKDAY_APPR_PROCESS_START',
'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE','ORGANIZATION_TYPE','NAME_INCOME_TYPE']
for input_var in inputs_to_drop:
selected_cat_features.remove(input_var)
print("Updated Categorical Columns based on Histograms of Distributions".upper())
print("-------------------------------------------------------------------------------------")
for col in selected_cat_features:
print(col)
print(f"\n# of Variables Listed Above Minus: {len(selected_cat_features)}")
UPDATED CATEGORICAL COLUMNS BASED ON HISTOGRAMS OF DISTRIBUTIONS ------------------------------------------------------------------------------------- NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_EDUCATION_TYPE NAME_FAMILY_STATUS OCCUPATION_TYPE # of Variables Listed Above Minus: 7
print("Final Features Selected from Application Training Set: ")
print()
print('NUMERICAL FEATURES: ')
print('----------------------')
for col in selected_num_features:
print(col)
print(f"\n# of Variables Listed Above: {len(selected_num_features)}")
print()
print()
print('CATEGORICAL FEATURES: ')
print('----------------------')
for col in selected_cat_features:
print(col)
print(f"\n# of Variables Listed Above: {len(selected_cat_features)}")
Final Features Selected from Application Training Set: NUMERICAL FEATURES: ---------------------- EXT_SOURCE_3 EXT_SOURCE_2 EXT_SOURCE_1 DAYS_BIRTH REGION_RATING_CLIENT_W_CITY DAYS_LAST_PHONE_CHANGE DAYS_ID_PUBLISH REG_CITY_NOT_WORK_CITY AMT_INCOME_TOTAL AMT_CREDIT # of Variables Listed Above: 10 CATEGORICAL FEATURES: ---------------------- NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_EDUCATION_TYPE NAME_FAMILY_STATUS OCCUPATION_TYPE # of Variables Listed Above: 7
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from time import time
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
#Table to track experimental results
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["Experiment Number",
"Model",
"# Transformed Input Features",
"# Original Numerical Features",
"# Original Categorical Features",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train F1",
"Valid F1",
"Test F1",
"Train AUROC",
"Valid AUROC",
"Test AUROC",
"Training Time",
"Training Prediction Time",
"Validation Prediction Time",
"Test Prediction Time",
"Hyperparameters",
"Best Parameter",
"Best Hypertuning Score",
"Description"])
display(expLog)
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description |
|---|
0 rows × 22 columns
# Function to train models
def train_model(df, exp_name, num_features, cat_features, pipeline):
features = num_features + cat_features
# Split data into Train, Test, and Validation Sets
y = train['TARGET']
X = train[features]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print("\nPERFORMING TRAINING: {exp_name}")
print("\tPipeline:",[name for name, _ in pipeline.steps])
print("\t# Total Features: ", len(features))
print("\nNumerical Features:")
print(num_features)
print("\t# Numerical Features: ", len(num_features))
print("\nCategorical Features:")
print(cat_features)
print("\t# Categorical Features: ", len(cat_features))
print('\ntraining in progress...')
#Fit the baseline pipeline to Training data
start=time()
model = pipeline.fit(X_train, y_train)
train_time = np.round(time() - start, 4)
print(f"\nBaseline Experiment with Original {len(features)} Input Variables - Training Time: %0.3fs" % (train_time))
return features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time
from sklearn.metrics import confusion_matrix
#Function to predict and score trained models
def predict_and_score(X, y, model, model_ID):
start = time()
y_pred = model.predict(X)
pred_time = time() - start
print("\tPrediction Time: %0.3fs" % (pred_time))
acc = accuracy_score(y, y_pred)
print("\tAccuracy Score: ", acc)
f1 = f1_score(y, y_pred)
print("\tF1 Score: ", f1)
auroc = roc_auc_score(y, model.predict_proba(X)[:, 1])
print("\tAUROC Score: ", auroc)
print("\tConfusion Matrix:")
class_labels = ["0: Repaid","1: Not Repaid"]
cm = confusion_matrix(y,y_pred).astype(np.float32)
cm /= cm.sum(axis=1)[:, np.newaxis]
cm_plot = sns.heatmap(cm, vmin=0, vmax=1, annot=True, cmap="Reds")
plt.xlabel("Predicted", fontsize=13)
plt.ylabel("True", fontsize=13)
cm_plot.set(xticklabels=class_labels, yticklabels=class_labels)
plt.title(model_ID, fontsize=13)
plt.show()
return (cm, y_pred, pred_time, acc, f1, auroc)
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
# Pipeline for the numeric features.
# Use StandardScaler() to standardize the data, Missing values will be imputed using the feature mean.
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler(with_mean=False))
])
# Pipeline for the categorical features.
# Entries with missing values or values that don't exist in the range defined above will be one hot encoded as zeroes.
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value= 'missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
#features_pipeline to combine Numerical and Categorical Pipelines
data_pipeline_17 = ColumnTransformer(
transformers= [
('num', num_pipeline, selected_num_features),
('cat', cat_pipeline, selected_cat_features)],
remainder='drop',
n_jobs=-1
)
# Baseline Experiment
baseline_pipeline_17 = Pipeline([
("preparation", data_pipeline_17),
("logRegression", LogisticRegression())
])
#Name of Experiment
exp_name = "Baseline 1, LogReg with Original 17 Selected Features"
#Description of Experiments
description = 'Baseline 1 LogReg Model with Preselected Num and Cat Features.'
#Start Experiment count for the expLog
exp_count = 1
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, selected_num_features, selected_cat_features, baseline_pipeline_17)
X train shape: (209107, 17)
X validation shape: (52277, 17)
X test shape: (46127, 17)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'logRegression']
# Total Features: 17
Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
# Numerical Features: 10
Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
# Categorical Features: 7
training in progress...
Baseline Experiment with Original 17 Input Variables - Training Time: 2.935s
X_train_transformed_17 = data_pipeline_17.fit_transform(X_train)
total_inputs_17 = X_train_transformed_17.shape[1]
# Training Set
print(f"Baseline Experiment with {total_inputs_17} Variables - Training Set:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc= predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print(f"Baseline Experiment with {total_inputs_17} Variables - Validation Set:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print(f"Baseline Experiment with {total_inputs_17} Variables - Test Set:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc= predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Baseline Experiment with 49 Variables - Training Set: Prediction Time: 0.713s Accuracy Score: 0.9198352996312893 F1 Score: 0.013534984993820987 AUROC Score: 0.7372722139860943 Confusion Matrix:
Baseline Experiment with 49 Variables - Validation Set: Prediction Time: 0.439s Accuracy Score: 0.9164068328327946 F1 Score: 0.015322217214961693 AUROC Score: 0.7379010540956517 Confusion Matrix:
Baseline Experiment with 49 Variables - Test Set: Prediction Time: 0.189s Accuracy Score: 0.9190062219524356 F1 Score: 0.01059322033898305 AUROC Score: 0.7367969785452071 Confusion Matrix:
expLog.loc[len(expLog)] = [exp_count,
exp_name,
total_inputs_17,
len(selected_num_features),
len(selected_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc,3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1,3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
display(expLog)
exp_count += 1
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
1 rows × 22 columns
# Input Features excluding SK_ID_CURR and TARGET
all_num_features = train.describe().columns.to_list()
all_cat_features = set(train.columns.to_list()) - set(all_num_features)
all_cat_features = list(all_cat_features)
all_num_features.remove('SK_ID_CURR') #ID has no effect on ability to repay loans
all_num_features.remove('TARGET')
#features_pipeline to combine Numerical and Categorical Pipelines of all features
data_pipeline_120 = ColumnTransformer(
transformers= [
('num', num_pipeline, all_num_features),
('cat', cat_pipeline, all_cat_features)],
remainder='drop',
n_jobs=-1
)
# Baseline Experiment with 120 Input Vars
baseline_pipeline_120 = Pipeline([
("preparation", data_pipeline_120),
("logRegression", LogisticRegression())
])
#Name of Experiment
exp_name = "Baseline 2, LogReg with original 120 Features"
#Description of Experiments
description = 'Baseline 2 LogReg Model with Num and Cat Features.'
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time= train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_120)
X train shape: (209107, 120)
X validation shape: (52277, 120)
X test shape: (46127, 120)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'logRegression']
# Total Features: 120
Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Numerical Features: 104
Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
# Categorical Features: 16
training in progress...
Baseline Experiment with Original 120 Input Variables - Training Time: 4.948s
X_train_transformed_120 = data_pipeline_120.fit_transform(X_train)
total_inputs_120 = X_train_transformed_120.shape[1]
# Training Set
print(f"Baseline Experiment Training Set with {total_inputs_120} Input Features:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print(f"Baseline Experiment Validation Set with {total_inputs_120} Input Features:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print(f"Baseline Experiment Test Set with {total_inputs_120} Input Features:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Baseline Experiment Training Set with 250 Input Features: Prediction Time: 1.188s Accuracy Score: 0.9199548556480653 F1 Score: 0.021512919443470127 AUROC Score: 0.745932186550156 Confusion Matrix:
Baseline Experiment Validation Set with 250 Input Features: Prediction Time: 0.360s Accuracy Score: 0.9163303173479733 F1 Score: 0.020161290322580648 AUROC Score: 0.7463851151424112 Confusion Matrix:
Baseline Experiment Test Set with 250 Input Features: Prediction Time: 0.321s Accuracy Score: 0.9193314111041255 F1 Score: 0.024127983215316024 AUROC Score: 0.7429677570702033 Confusion Matrix:
expLog.loc[len(expLog)] = [exp_count,
exp_name,
total_inputs_120,
len(all_num_features),
len(all_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc,3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1,3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
display(expLog)
exp_count += 1
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
2 rows × 22 columns
# LogReg Experiment with L1 Penalty (L2 is default)
L1_pipeline_17 = Pipeline([
("preparation", data_pipeline_17),
("lassoRegression", LogisticRegression(penalty='l1', solver = 'saga'))
])
#Name of Experiment
exp_name = "LogReg - L1 Penalty with Selected 17 Features"
#Description of Experiments
description = 'LogReg Model-L1 Penalty with Selected 17 Cat + Num Features.'
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, selected_num_features, selected_cat_features, L1_pipeline_17)
X train shape: (209107, 17)
X validation shape: (52277, 17)
X test shape: (46127, 17)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'lassoRegression']
# Total Features: 17
Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
# Numerical Features: 10
Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
# Categorical Features: 7
training in progress...
Baseline Experiment with Original 17 Input Variables - Training Time: 15.145s
# Training Set
print(f"LogReg - L1 Penalty Training Set with {total_inputs_17} Input Features:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print(f"LogReg - L1 Penalty Validation Set with {total_inputs_17} Input Features:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print(f"LogReg - L1 Penalty Test Set with {total_inputs_17} Input Features:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
LogReg - L1 Penalty Training Set with 49 Input Features: Prediction Time: 0.468s Accuracy Score: 0.9198400818719603 F1 Score: 0.0136518771331058 AUROC Score: 0.7372155400141124 Confusion Matrix:
LogReg - L1 Penalty Validation Set with 49 Input Features: Prediction Time: 0.210s Accuracy Score: 0.9164450905752052 F1 Score: 0.016216216216216217 AUROC Score: 0.7379875452540834 Confusion Matrix:
LogReg - L1 Penalty Test Set with 49 Input Features: Prediction Time: 0.191s Accuracy Score: 0.919027901229215 F1 Score: 0.011119936457505957 AUROC Score: 0.7369082524977961 Confusion Matrix:
expLog.loc[len(expLog)] = [exp_count,
exp_name,
total_inputs_17,
len(selected_num_features),
len(selected_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc,3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1,3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
display(expLog)
exp_count += 1
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
3 rows × 22 columns
# LogReg Experiment with L1 Penalty (L2 is default)
L1_pipeline_120 = Pipeline([
("preparation", data_pipeline_120),
("lassoRegression", LogisticRegression(penalty='l1', solver = 'saga'))
])
#Name of Experiment
exp_name = "LogReg - L1 Penalty with 120 Features"
#Description of Experiments
description = 'LogReg Model-L1 Penalty with 104 Num + 16 Cat Features.'
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, L1_pipeline_120)
X train shape: (209107, 120)
X validation shape: (52277, 120)
X test shape: (46127, 120)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'lassoRegression']
# Total Features: 120
Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Numerical Features: 104
Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
# Categorical Features: 16
training in progress...
Baseline Experiment with Original 120 Input Variables - Training Time: 60.468s
# Training Set
print(f"LogReg - L1 Penalty Training Set with {total_inputs_120} Input Features:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print(f"LogReg - L1 Penalty Validation Set with {total_inputs_120} Input Features:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print(f"LogReg - L1 Penalty Test Set with {total_inputs_120} Input Features:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
LogReg - L1 Penalty Training Set with 250 Input Features: Prediction Time: 1.212s Accuracy Score: 0.9198735575566576 F1 Score: 0.016898433374405917 AUROC Score: 0.7440194523416428 Confusion Matrix:
LogReg - L1 Penalty Validation Set with 250 Input Features: Prediction Time: 0.351s Accuracy Score: 0.916311188476768 F1 Score: 0.013528748590755355 AUROC Score: 0.745327526606072 Confusion Matrix:
LogReg - L1 Penalty Test Set with 250 Input Features: Prediction Time: 0.323s Accuracy Score: 0.9192013354434496 F1 Score: 0.017918313570487485 AUROC Score: 0.7427405148645342 Confusion Matrix:
expLog.loc[len(expLog)] = [exp_count,
exp_name,
total_inputs_120,
len(all_num_features),
len(all_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc,3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1,3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
display(expLog)
exp_count += 1
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 3 | 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.017 | 0.014 | ... | 0.745 | 0.743 | 60.4684 | 1.212351 | 0.351223 | 0.323352 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
4 rows × 22 columns
# All 120 Input Features Plus New Feature Transformation in Pipeline: Debt-to-Income Ratio
from sklearn.base import BaseEstimator, TransformerMixin
class Debt_to_Income_Ratio(BaseEstimator, TransformerMixin):
def __init__(self, features=None): # no *args or **kargs
self.features = features
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
df = pd.DataFrame(X.copy(), columns=self.features) # select a subset of columns in X based on self.features
feature1 = 'AMT_CREDIT'
feature2 = 'AMT_INCOME_TOTAL'
#Create new column for debt-to-income ratio
df['DEBT_TO_INCOME_RATIO'] = df[feature1] / df[feature2]
#Drop the features that were initially passed
df.drop(feature1, axis=1, inplace=True)
df.drop(feature2, axis=1, inplace=True)
#Return df
return df
test_pipeline = make_pipeline(Debt_to_Income_Ratio())
debt_income_ratio = test_pipeline.fit_transform (X_train[['AMT_CREDIT', 'AMT_INCOME_TOTAL']])
display(pd.DataFrame(np.c_[X_train[['AMT_CREDIT', 'AMT_INCOME_TOTAL']],debt_income_ratio], columns=['AMT_CREDIT', 'AMT_INCOME_TOTAL'] + ["DEBT_INCOME_RATIO"]) )
| AMT_CREDIT | AMT_INCOME_TOTAL | DEBT_INCOME_RATIO | |
|---|---|---|---|
| 0 | 540000.0 | 144000.0 | 3.750000 |
| 1 | 1762110.0 | 225000.0 | 7.831600 |
| 2 | 161730.0 | 135000.0 | 1.198000 |
| 3 | 270000.0 | 67500.0 | 4.000000 |
| 4 | 1381113.0 | 202500.0 | 6.820311 |
| ... | ... | ... | ... |
| 209102 | 1762110.0 | 270000.0 | 6.526333 |
| 209103 | 284400.0 | 112500.0 | 2.528000 |
| 209104 | 180000.0 | 45000.0 | 4.000000 |
| 209105 | 1736937.0 | 202500.0 | 8.577467 |
| 209106 | 157500.0 | 58500.0 | 2.692308 |
209107 rows × 3 columns
data_pipeline_DIR_120 = ColumnTransformer(
transformers= [
# (name, transformer, columns)
('num', num_pipeline, all_num_features),
('cat', cat_pipeline, all_cat_features),
('DIR', make_pipeline(Debt_to_Income_Ratio(), StandardScaler()), ['AMT_CREDIT', 'AMT_INCOME_TOTAL'])
],
remainder='drop',
n_jobs=-1
)
X_train_transformed = data_pipeline_DIR_120.fit_transform(X_train)
column_names = all_num_features + \
list(data_pipeline_DIR_120.transformers_[1][1].named_steps["onehot"].get_feature_names(all_cat_features)) +\
['DEBT_TO_INCOME_RATIO']
display(pd.DataFrame(X_train_transformed, columns=column_names).head())
number_of_inputs = X_train_transformed.shape[1]
| CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | ... | FLAG_OWN_CAR_Y | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | WALLSMATERIAL_MODE_missing | DEBT_TO_INCOME_RATIO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.763729 | 1.355273 | 1.339199 | 2.022258 | 1.459046 | 1.496970 | -2.541537 | -0.020777 | -1.388585 | -1.504414 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | -0.077059 |
| 1 | 1.381865 | 2.117614 | 4.370030 | 3.208587 | 4.255552 | 1.778614 | -2.978958 | -0.021845 | -0.473084 | -2.666495 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.437253 |
| 2 | 0.000000 | 1.270568 | 0.401090 | 0.785916 | 0.364762 | 0.733344 | -5.520724 | 2.583799 | -3.407287 | -2.883019 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | -1.023874 |
| 3 | 0.000000 | 0.635284 | 0.669600 | 0.961737 | 0.729523 | 2.222725 | -4.295578 | -0.034366 | -2.235083 | -1.503752 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.015694 |
| 4 | 0.000000 | 1.905853 | 3.425158 | 2.630799 | 3.258537 | 2.222725 | -2.070415 | -0.010463 | -1.092978 | -0.992569 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.062055 |
5 rows × 251 columns
# Baseline Experiment with 120 Input Vars
baseline_pipeline_DIR_120 = Pipeline([
("preparation", data_pipeline_DIR_120),
("logRegression", LogisticRegression())
])
#Name of Experiment)
exp_name = "LogReg with Num and Cat Features + Debt_Income_Ratio"
#Description of Experiments
description =" Logistic Regression Model with Original 120 Num and Cat Features + Debt-Income-Ratio."
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_DIR_120)
X train shape: (209107, 120)
X validation shape: (52277, 120)
X test shape: (46127, 120)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'logRegression']
# Total Features: 120
Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Numerical Features: 104
Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
# Categorical Features: 16
training in progress...
Baseline Experiment with Original 120 Input Variables - Training Time: 5.076s
total_inputs = X_train_transformed.shape[1]
# Training Set
print(f"Training Set with all 120 input features + Added Debt-Income-Ratio Feature:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print(f"Validation Set with all 120 input features + Added Debt-Income-Ratio Feature:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print(f"Test Set with all 120 input features + Added Debt-Income-Ratio Feature:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Training Set with all 120 input features + Added Debt-Income-Ratio Feature: Prediction Time: 1.156s Accuracy Score: 0.9198352996312893 F1 Score: 0.020566754309085597 AUROC Score: 0.7453773395654856 Confusion Matrix:
Validation Set with all 120 input features + Added Debt-Income-Ratio Feature: Prediction Time: 0.350s Accuracy Score: 0.9161198997647149 F1 Score: 0.01791713325867861 AUROC Score: 0.7456796517828043 Confusion Matrix:
Test Set with all 120 input features + Added Debt-Income-Ratio Feature: Prediction Time: 0.326s Accuracy Score: 0.9192013354434496 F1 Score: 0.02255441909257802 AUROC Score: 0.7430448006911027 Confusion Matrix:
expLog.loc[len(expLog)] = [exp_count,
exp_name,
total_inputs,
len(all_num_features),
len(all_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc, 3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1, 3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
display(expLog)
exp_count += 1
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 3 | 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.017 | 0.014 | ... | 0.745 | 0.743 | 60.4684 | 1.212351 | 0.351223 | 0.323352 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
| 4 | 5 | LogReg with Num and Cat Features + Debt_Income... | 251 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.021 | 0.018 | ... | 0.746 | 0.743 | 5.0762 | 1.155572 | 0.350079 | 0.325508 | N/A | N/A | N/A | Logistic Regression Model with Original 120 N... |
5 rows × 22 columns
data_pipeline_DIR_17 = ColumnTransformer(
transformers= [
# (name, transformer, columns)
('num', num_pipeline, selected_num_features),
('cat', cat_pipeline, selected_cat_features),
('DIR', make_pipeline(Debt_to_Income_Ratio(), StandardScaler()), ['AMT_CREDIT', 'AMT_INCOME_TOTAL'])
],
remainder='drop',
n_jobs=-1
)
baseline_pipeline_DIR_17 = Pipeline([
("preparation", data_pipeline_DIR_17),
("logRegression", LogisticRegression())
])
X_train_transformed = data_pipeline_DIR_17.fit_transform(X_train)
total_inputs = X_train_transformed.shape[1]
#Name of Experiment)
exp_name = "LogReg with Num and Cat Features + Debt_Income_Ratio"
#Description of Experiments
description =" Logistic Regression Model with Original 17 Num and Cat Features + Debt-Income-Ratio."
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_DIR_17)
X train shape: (209107, 120)
X validation shape: (52277, 120)
X test shape: (46127, 120)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'logRegression']
# Total Features: 120
Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Numerical Features: 104
Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
# Categorical Features: 16
training in progress...
Baseline Experiment with Original 120 Input Variables - Training Time: 2.117s
# Training Set
print(f"Training Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print(f"Validation Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print(f"Test Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
Training Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature: Prediction Time: 0.463s Accuracy Score: 0.9198592108346445 F1 Score: 0.014235294117647058 AUROC Score: 0.7377964403845755 Confusion Matrix:
Validation Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature: Prediction Time: 0.200s Accuracy Score: 0.9164450905752052 F1 Score: 0.016216216216216217 AUROC Score: 0.7379073053753373 Confusion Matrix:
Test Set with Selected 22 Input Features + Added Debt-Income-Ratio Feature: Prediction Time: 0.184s Accuracy Score: 0.9190712597827736 F1 Score: 0.011649457241196717 AUROC Score: 0.7373531897169191 Confusion Matrix:
expLog.loc[len(expLog)] = [exp_count,
exp_name,
total_inputs,
len(all_num_features),
len(all_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc, 3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1, 3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
display(expLog)
exp_count += 1
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 3 | 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.017 | 0.014 | ... | 0.745 | 0.743 | 60.4684 | 1.212351 | 0.351223 | 0.323352 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
| 4 | 5 | LogReg with Num and Cat Features + Debt_Income... | 251 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.021 | 0.018 | ... | 0.746 | 0.743 | 5.0762 | 1.155572 | 0.350079 | 0.325508 | N/A | N/A | N/A | Logistic Regression Model with Original 120 N... |
| 5 | 6 | LogReg with Num and Cat Features + Debt_Income... | 50 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 2.1173 | 0.463066 | 0.200407 | 0.184219 | N/A | N/A | N/A | Logistic Regression Model with Original 17 Nu... |
6 rows × 22 columns
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
clf_names = ["Random Forest",
# "SVC"
]
clfs = [RandomForestClassifier(n_jobs=-1, class_weight='balanced'),
# SVC()
]
for clf_name, clf in zip(clf_names, clfs):
print("-----------------------------------------------------")
print(f"{clf_name.upper()}")
print("-----------------------------------------------------")
pipe = Pipeline([
("preparation", data_pipeline_17),
("clf", clf),
])
# Name of Experiment
exp_name = clf_name +" with 17 Features"
# Description of Experiment
description = f'{clf_name} Model with 10 Num + 7 Cat Features.'
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time= train_model(train, exp_name, selected_num_features, selected_cat_features, pipe)
# Training Set
print("Baseline Experiment with 17 Variables - Training Set:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print("Baseline Experiment with 17 Variables - Validation Set:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print("Baseline Experiment with 17 Variables - Test Set:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
expLog.loc[len(expLog)] = [exp_count,
exp_name,
total_inputs_17,
len(selected_num_features),
len(selected_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc,3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1,3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
exp_count += 1
-----------------------------------------------------
RANDOM FOREST
-----------------------------------------------------
X train shape: (209107, 17)
X validation shape: (52277, 17)
X test shape: (46127, 17)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'clf']
# Total Features: 17
Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
# Numerical Features: 10
Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
# Categorical Features: 7
training in progress...
Baseline Experiment with Original 17 Input Variables - Training Time: 4.923s
Baseline Experiment with 17 Variables - Training Set:
Prediction Time: 0.840s
Accuracy Score: 0.9999713065559738
F1 Score: 0.9998207242739333
AUROC Score: 1.0
Confusion Matrix:
Baseline Experiment with 17 Variables - Validation Set: Prediction Time: 0.325s Accuracy Score: 0.9163685750903839 F1 Score: 0.004100227790432802 AUROC Score: 0.7195105509709813 Confusion Matrix:
Baseline Experiment with 17 Variables - Test Set: Prediction Time: 0.295s Accuracy Score: 0.9195265245951395 F1 Score: 0.006955591225254147 AUROC Score: 0.7205793058613987 Confusion Matrix:
display(expLog)
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 3 | 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.017 | 0.014 | ... | 0.745 | 0.743 | 60.4684 | 1.212351 | 0.351223 | 0.323352 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
| 4 | 5 | LogReg with Num and Cat Features + Debt_Income... | 251 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.021 | 0.018 | ... | 0.746 | 0.743 | 5.0762 | 1.155572 | 0.350079 | 0.325508 | N/A | N/A | N/A | Logistic Regression Model with Original 120 N... |
| 5 | 6 | LogReg with Num and Cat Features + Debt_Income... | 50 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 2.1173 | 0.463066 | 0.200407 | 0.184219 | N/A | N/A | N/A | Logistic Regression Model with Original 17 Nu... |
| 6 | 7 | Random Forest with 17 Features | 49 | 10 | 7 | 1.00 | 0.916 | 0.920 | 1.000 | 0.004 | ... | 0.720 | 0.721 | 4.9230 | 0.839898 | 0.325235 | 0.294644 | N/A | N/A | N/A | Random Forest Model with 10 Num + 7 Cat Features. |
7 rows × 22 columns
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None,):
return X.todense()
clf_names = ["Gradboost",
# "SVC"
]
clfs = [HistGradientBoostingClassifier()
# SVC()
]
for clf_name, clf in zip(clf_names, clfs):
print("-----------------------------------------------------")
print(f"{clf_name.upper()}")
print("-----------------------------------------------------")
pipe = Pipeline([
("preparation", data_pipeline_17),
#("to_dense", DenseTransformer()),
("clf", clf)
])
# Name of Experiment
exp_name = clf_name +" with 17 Features"
# Description of Experiment
description = f'{clf_name} Model with 10 Num + 7 Cat Features.'
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, selected_num_features, selected_cat_features, pipe)
# Training Set
print("Baseline Experiment with 17 Variables - Training Set:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print("Baseline Experiment with 17 Variables - Validation Set:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print("Baseline Experiment with 17 Variables - Test Set:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
expLog.loc[len(expLog)] = [exp_count,
exp_name,
len(features),
len(selected_num_features),
len(selected_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc,3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1,3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
exp_count += 1
display(expLog)
-----------------------------------------------------
GRADBOOST
-----------------------------------------------------
X train shape: (209107, 17)
X validation shape: (52277, 17)
X test shape: (46127, 17)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'clf']
# Total Features: 17
Numerical Features:
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY', 'DAYS_LAST_PHONE_CHANGE', 'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']
# Numerical Features: 10
Categorical Features:
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE']
# Categorical Features: 7
training in progress...
Baseline Experiment with Original 17 Input Variables - Training Time: 1.821s
Baseline Experiment with 17 Variables - Training Set:
Prediction Time: 0.591s
Accuracy Score: 0.9207056674334193
F1 Score: 0.030974227105370813
AUROC Score: 0.7740722497994058
Confusion Matrix:
Baseline Experiment with 17 Variables - Validation Set: Prediction Time: 0.261s Accuracy Score: 0.9166555081584635 F1 Score: 0.021997755331088664 AUROC Score: 0.7464147311171788 Confusion Matrix:
Baseline Experiment with 17 Variables - Test Set: Prediction Time: 0.235s Accuracy Score: 0.9197433173629328 F1 Score: 0.024248813916710594 AUROC Score: 0.7458278199091248 Confusion Matrix:
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 3 | 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.017 | 0.014 | ... | 0.745 | 0.743 | 60.4684 | 1.212351 | 0.351223 | 0.323352 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
| 4 | 5 | LogReg with Num and Cat Features + Debt_Income... | 251 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.021 | 0.018 | ... | 0.746 | 0.743 | 5.0762 | 1.155572 | 0.350079 | 0.325508 | N/A | N/A | N/A | Logistic Regression Model with Original 120 N... |
| 5 | 6 | LogReg with Num and Cat Features + Debt_Income... | 50 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 2.1173 | 0.463066 | 0.200407 | 0.184219 | N/A | N/A | N/A | Logistic Regression Model with Original 17 Nu... |
| 6 | 7 | Random Forest with 17 Features | 49 | 10 | 7 | 1.000 | 0.916 | 0.920 | 1.000 | 0.004 | ... | 0.720 | 0.721 | 4.9230 | 0.839898 | 0.325235 | 0.294644 | N/A | N/A | N/A | Random Forest Model with 10 Num + 7 Cat Features. |
| 7 | 8 | Gradboost with 17 Features | 17 | 10 | 7 | 0.921 | 0.917 | 0.920 | 0.031 | 0.022 | ... | 0.746 | 0.746 | 1.8211 | 0.591207 | 0.261072 | 0.234816 | N/A | N/A | N/A | Gradboost Model with 10 Num + 7 Cat Features. |
8 rows × 22 columns
clf_names = ["Gradboost",
# "SVC"
]
clfs = [HistGradientBoostingClassifier()
# SVC()
]
for clf_name, clf in zip(clf_names, clfs):
print("-----------------------------------------------------")
print(f"{clf_name.upper()}")
print("-----------------------------------------------------")
pipe = Pipeline([
("preparation", data_pipeline_120),
("to_dense", DenseTransformer()),
("clf", clf)
])
# Name of Experiment
exp_name = clf_name +" with 120 Features"
# Description of Experiment
description = f'{clf_name} Model with 104 Num + 16 Cat Features.'
features, X_train, X_valid, X_test, y_train, y_valid, y_test, model, train_time = train_model(train, exp_name, all_num_features, all_cat_features, baseline_pipeline_120)
# Training Set
print("Baseline Experiment with 120 Variables - Training Set:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print("Baseline Experiment with 120 Variables - Validation Set:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print("Baseline Experiment with 120 Variables - Test Set:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
expLog.loc[len(expLog)] = [exp_count,
exp_name,
len(features),
len(selected_num_features),
len(selected_cat_features),
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc,3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1,3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
"N/A",
"N/A",
"N/A",
description]
exp_count += 1
display(expLog)
-----------------------------------------------------
GRADBOOST
-----------------------------------------------------
X train shape: (209107, 120)
X validation shape: (52277, 120)
X test shape: (46127, 120)
PERFORMING TRAINING: {exp_name}
Pipeline: ['preparation', 'logRegression']
# Total Features: 120
Numerical Features:
['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
# Numerical Features: 104
Categorical Features:
['FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'ORGANIZATION_TYPE', 'HOUSETYPE_MODE', 'NAME_EDUCATION_TYPE', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'WEEKDAY_APPR_PROCESS_START', 'OCCUPATION_TYPE', 'NAME_HOUSING_TYPE', 'EMERGENCYSTATE_MODE', 'FONDKAPREMONT_MODE', 'NAME_FAMILY_STATUS', 'NAME_INCOME_TYPE', 'FLAG_OWN_CAR', 'WALLSMATERIAL_MODE']
# Categorical Features: 16
training in progress...
Baseline Experiment with Original 120 Input Variables - Training Time: 5.113s
Baseline Experiment with 120 Variables - Training Set:
Prediction Time: 1.147s
Accuracy Score: 0.9199548556480653
F1 Score: 0.021512919443470127
AUROC Score: 0.745932186550156
Confusion Matrix:
Baseline Experiment with 120 Variables - Validation Set: Prediction Time: 0.364s Accuracy Score: 0.9163303173479733 F1 Score: 0.020161290322580648 AUROC Score: 0.7463851151424112 Confusion Matrix:
Baseline Experiment with 120 Variables - Test Set: Prediction Time: 0.324s Accuracy Score: 0.9193314111041255 F1 Score: 0.024127983215316024 AUROC Score: 0.7429677570702033 Confusion Matrix:
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 3 | 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.017 | 0.014 | ... | 0.745 | 0.743 | 60.4684 | 1.212351 | 0.351223 | 0.323352 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
| 4 | 5 | LogReg with Num and Cat Features + Debt_Income... | 251 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.021 | 0.018 | ... | 0.746 | 0.743 | 5.0762 | 1.155572 | 0.350079 | 0.325508 | N/A | N/A | N/A | Logistic Regression Model with Original 120 N... |
| 5 | 6 | LogReg with Num and Cat Features + Debt_Income... | 50 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 2.1173 | 0.463066 | 0.200407 | 0.184219 | N/A | N/A | N/A | Logistic Regression Model with Original 17 Nu... |
| 6 | 7 | Random Forest with 17 Features | 49 | 10 | 7 | 1.000 | 0.916 | 0.920 | 1.000 | 0.004 | ... | 0.720 | 0.721 | 4.9230 | 0.839898 | 0.325235 | 0.294644 | N/A | N/A | N/A | Random Forest Model with 10 Num + 7 Cat Features. |
| 7 | 8 | Gradboost with 17 Features | 17 | 10 | 7 | 0.921 | 0.917 | 0.920 | 0.031 | 0.022 | ... | 0.746 | 0.746 | 1.8211 | 0.591207 | 0.261072 | 0.234816 | N/A | N/A | N/A | Gradboost Model with 10 Num + 7 Cat Features. |
| 8 | 9 | Gradboost with 120 Features | 120 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 5.1127 | 1.147315 | 0.364037 | 0.323979 | N/A | N/A | N/A | Gradboost Model with 104 Num + 16 Cat Features. |
9 rows × 22 columns
clf_best_parameters = {}
# Function to run GridSearchCV and log experiments
def gs_classifier(in_features, clf_name, clf, parameters, expCount):
y = train['TARGET']
X = train[in_features]
total_selected_inputs = len(in_features)
numerical_features = X.describe().columns.to_list()
total_num_inputs = len(numerical_features)
categorical_features = set(X.columns.to_list()) - set(numerical_features)
categorical_features = list(categorical_features)
total_cat_inputs = len(categorical_features)
description = f'{clf_name} with {total_selected_inputs}'
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
data_pipeline = ColumnTransformer(transformers=[
("num_pipeline", num_pipeline, numerical_features),
("cat_pipeline", cat_pipeline, categorical_features)],
remainder='drop',
n_jobs=-1
)
clf_pipeline = Pipeline([
("preparation", data_pipeline),# combination of numerical, categorical subpipelines
("clf", clf) # classifier estimator you are using
])
gs = GridSearchCV(clf_pipeline,
parameters,
scoring=['f1','roc_auc'],
cv=3,
refit='roc_auc',
n_jobs=-1,
verbose=1)
print("\nPERFORMING GRID SEARCH FOR {}...".format(clf_name.upper()))
print("\tpipeline:",[name for name, _ in clf_pipeline.steps])
print("\tparameters:", parameters)
print()
start = time()
gs.fit(X_train, y_train)
train_time = time() - start
print("\tTraining Time: %0.3fs" % (time() - start))
print()
# Training Set
print(f"{clf_name} Training Set with {total_selected_inputs} Input Features:")
cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set')
# Validation Set
print(f"{clf_name} Validation Set with {total_selected_inputs} Input Features:")
cm_valid, y_pred_valid, pred_time_valid, valid_acc, valid_f1, valid_auroc = predict_and_score(X_valid, y_valid, model, exp_name+' - Validation Set')
# Test Set
print(f"{clf_name} Experiment Test Set with {total_selected_inputs} Input Features:")
cm_test, y_pred_test, pred_time_test, test_acc, test_f1, test_auroc = predict_and_score(X_test, y_test, model, exp_name+' - Test Set')
print("\n\tBest score: %0.3f" % gs.best_score_)
print("\tBest parameters set:")
best_parameters = gs.best_estimator_.get_params()
best_parameters_dict = {}
for param_name in sorted(parameters.keys()):
print("\t\t%s: %r" % (param_name, best_parameters[param_name]))
best_parameters_dict[param_name] = best_parameters[param_name]
clf_best_parameters[clf_name] = best_parameters_dict
print()
print()
expLog.loc[len(expLog)] = [exp_count,
clf_name,
total_selected_inputs,
total_num_inputs,
total_cat_inputs,
round(train_acc, 3),
round(valid_acc, 3),
round(test_acc, 3),
round(train_f1, 3),
round(valid_f1, 3),
round(test_f1, 3),
round(train_auroc, 3),
round(valid_auroc, 3),
round(test_auroc,3),
train_time,
pred_time_train,
pred_time_valid,
pred_time_test,
parameters,
best_parameters_dict,
round(gs.best_score_,3),
description]
exp_count += 1
# Grid Search over Preparation Pipeline and Classifiers
clf_names = ["Random Forest",
# "Logistic Regression",
# "SVC",
]
estimators = [RandomForestClassifier(),
# LogisticRegression(solver='saga'),
# SVC(),
]
param_grids = [{'clf__n_estimators':[300,500],
'clf__max_features':['sqrt','log2',None],
},
# 'clf__C': [1.0, 10.0, 100.0, 1000.0, 10000.0],
# 'clf__penalty':[None, 'l1','l2']},
# {'clf__C': [0.001, 0.01, 0.1, 1.],
# 'clf__kernel': ["linear", "poly", "rbf", "sigmoid"],
# 'clf__gamma':["scale", "auto"]}
]
selected_features = selected_num_features + selected_cat_features
expCount = 1
for clf_name, clf, parameters in zip(clf_names, estimators, param_grids):
gs_classifier(selected_features, clf_name, clf, parameters, expCount)
expCount += 1
X train shape: (209107, 17)
X validation shape: (52277, 17)
X test shape: (46127, 17)
PERFORMING GRID SEARCH FOR RANDOM FOREST...
pipeline: ['preparation', 'clf']
parameters: {'clf__n_estimators': [300, 500], 'clf__max_features': ['sqrt', 'log2', None]}
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Training Time: 938.274s
Random Forest Training Set with 17 Input Features:
--------------------------------------------------------------------------- Empty Traceback (most recent call last) /usr/local/lib/python3.9/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator) 821 try: --> 822 tasks = self._ready_batches.get(block=False) 823 except queue.Empty: /usr/local/lib/python3.9/queue.py in get(self, block, timeout) 167 if not self._qsize(): --> 168 raise Empty 169 elif timeout is None: Empty: During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-63-510d7f5677fb> in <module> 26 expCount = 1 27 for clf_name, clf, parameters in zip(clf_names, estimators, param_grids): ---> 28 gs_classifier(selected_features, clf_name, clf, parameters, expCount) 29 expCount += 1 <ipython-input-62-884142a3426c> in gs_classifier(in_features, clf_name, clf, parameters, expCount) 59 # Training Set 60 print(f"{clf_name} Training Set with {total_selected_inputs} Input Features:") ---> 61 cm_train, y_pred_train, pred_time_train, train_acc, train_f1, train_auroc = predict_and_score(X_train, y_train, model, exp_name+' - Training Set') 62 63 # Validation Set <ipython-input-28-1f0a222ce133> in predict_and_score(X, y, model, model_ID) 4 def predict_and_score(X, y, model, model_ID): 5 start = time() ----> 6 y_pred = model.predict(X) 7 pred_time = time() - start 8 /usr/local/lib/python3.9/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs) 111 112 # lambda, but not partial, allows help() to work with update_wrapper --> 113 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) # noqa 114 else: 115 /usr/local/lib/python3.9/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params) 467 Xt = X 468 for _, name, transform in self._iter(with_final=False): --> 469 Xt = transform.transform(Xt) 470 return self.steps[-1][1].predict(Xt, **predict_params) 471 /usr/local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in transform(self, X) 746 self._check_n_features(X, reset=False) 747 --> 748 Xs = self._fit_transform( 749 X, 750 None, /usr/local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted, column_as_strings) 604 ) 605 try: --> 606 return Parallel(n_jobs=self.n_jobs)( 607 delayed(func)( 608 transformer=clone(trans) if not fitted else trans, /usr/local/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable) 1041 # remaining jobs. 1042 self._iterating = False -> 1043 if self.dispatch_one_batch(iterator): 1044 self._iterating = self._original_iterator is not None 1045 /usr/local/lib/python3.9/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator) 831 big_batch_size = batch_size * n_jobs 832 --> 833 islice = list(itertools.islice(iterator, big_batch_size)) 834 if len(islice) == 0: 835 return False /usr/local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in <genexpr>(.0) 607 delayed(func)( 608 transformer=clone(trans) if not fitted else trans, --> 609 X=_safe_indexing(X, column, axis=1), 610 y=y, 611 weight=weight, /usr/local/lib/python3.9/site-packages/sklearn/utils/__init__.py in _safe_indexing(X, indices, axis) 374 375 if hasattr(X, "iloc"): --> 376 return _pandas_indexing(X, indices, indices_dtype, axis=axis) 377 elif hasattr(X, "shape"): 378 return _array_indexing(X, indices, indices_dtype, axis=axis) /usr/local/lib/python3.9/site-packages/sklearn/utils/__init__.py in _pandas_indexing(X, key, key_dtype, axis) 220 # check whether we should index with loc or iloc 221 indexer = X.iloc if key_dtype == "int" else X.loc --> 222 return indexer[:, key] if axis else indexer[key] 223 224 /usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in __getitem__(self, key) 923 with suppress(KeyError, IndexError): 924 return self.obj._get_value(*key, takeable=self._takeable) --> 925 return self._getitem_tuple(key) 926 else: 927 # we by definition only have the 0th axis /usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup) 1107 return self._multi_take(tup) 1108 -> 1109 return self._getitem_tuple_same_dim(tup) 1110 1111 def _get_label(self, label, axis: int): /usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_tuple_same_dim(self, tup) 804 continue 805 --> 806 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 807 # We should never have retval.ndim < self.ndim, as that should 808 # be handled by the _getitem_lowerdim call above. /usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1151 raise ValueError("Cannot index with multidimensional key") 1152 -> 1153 return self._getitem_iterable(key, axis=axis) 1154 1155 # nested tuple slicing /usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis) 1091 1092 # A collection of keys -> 1093 keyarr, indexer = self._get_listlike_indexer(key, axis) 1094 return self.obj._reindex_with_indexers( 1095 {axis: [keyarr, indexer]}, copy=True, allow_dups=True /usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis) 1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1313 -> 1314 self._validate_read_indexer(keyarr, indexer, axis) 1315 1316 if needs_i8_conversion(ax.dtype) or isinstance( /usr/local/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis) 1375 1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) -> 1377 raise KeyError(f"{not_found} not in index") 1378 1379 KeyError: "['CNT_CHILDREN', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] not in index"
display(expLog)
| Experiment Number | Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.014 | 0.015 | ... | 0.738 | 0.737 | 2.9348 | 0.713401 | 0.438845 | 0.189358 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 1 | 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 4.9481 | 1.188051 | 0.359550 | 0.321185 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 2 | 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 15.1447 | 0.467919 | 0.210027 | 0.190774 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 3 | 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.017 | 0.014 | ... | 0.745 | 0.743 | 60.4684 | 1.212351 | 0.351223 | 0.323352 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
| 4 | 5 | LogReg with Num and Cat Features + Debt_Income... | 251 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.021 | 0.018 | ... | 0.746 | 0.743 | 5.0762 | 1.155572 | 0.350079 | 0.325508 | N/A | N/A | N/A | Logistic Regression Model with Original 120 N... |
| 5 | 6 | LogReg with Num and Cat Features + Debt_Income... | 50 | 104 | 16 | 0.920 | 0.916 | 0.919 | 0.014 | 0.016 | ... | 0.738 | 0.737 | 2.1173 | 0.463066 | 0.200407 | 0.184219 | N/A | N/A | N/A | Logistic Regression Model with Original 17 Nu... |
| 6 | 7 | Random Forest with 17 Features | 49 | 10 | 7 | 1.000 | 0.916 | 0.920 | 1.000 | 0.004 | ... | 0.720 | 0.721 | 4.9230 | 0.839898 | 0.325235 | 0.294644 | N/A | N/A | N/A | Random Forest Model with 10 Num + 7 Cat Features. |
| 7 | 8 | Gradboost with 17 Features | 17 | 10 | 7 | 0.921 | 0.917 | 0.920 | 0.031 | 0.022 | ... | 0.746 | 0.746 | 1.8211 | 0.591207 | 0.261072 | 0.234816 | N/A | N/A | N/A | Gradboost Model with 10 Num + 7 Cat Features. |
| 8 | 9 | Gradboost with 120 Features | 120 | 10 | 7 | 0.920 | 0.916 | 0.919 | 0.022 | 0.020 | ... | 0.746 | 0.743 | 5.1127 | 1.147315 | 0.364037 | 0.323979 | N/A | N/A | N/A | Gradboost Model with 104 Num + 16 Cat Features. |
9 rows × 22 columns
# Function Build Barcharts of scores for all models
acc_df = expLog[['Model', 'Train Acc', 'Valid Acc', 'Test Acc']].copy()
F1_df = expLog[['Model','Train F1', 'Valid F1', 'Test F1']].copy()
AUROC_df = expLog[['Model','Train AUROC', 'Valid AUROC', 'Test AUROC']].copy()
def score_barchart(df, title):
# Plot the bar chart
df.set_index('Model', inplace=True)
ax = df.plot(kind='bar', figsize=(10, 6))
plt.title(f'{title} Score Comparison')
plt.ylabel(title)
plt.xticks(rotation=90)
plt.show()
test_class_scores = model.predict_proba(X_kaggle_test)[:, 1]
test_class_scores[0:10]
array([0.06091203, 0.2319424 , 0.03704471, 0.03813419, 0.1346362 ,
0.03247158, 0.02453353, 0.09834088, 0.01280437, 0.15194608])
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.060912 |
| 1 | 100005 | 0.231942 |
| 2 | 100013 | 0.037045 |
| 3 | 100028 | 0.038134 |
| 4 | 100038 | 0.134636 |
submit_df.to_csv("submission.csv",index=False)
Team: GroupN_HCDR_1
Team Members:
Phase Leadership Plan
| Phase | Project Manager |
|---|---|
| 1 | Jacob |
| 2 | Leona |
| 3 | Olga |
| 4 | Nimish |
Credit Assignment Plan
Overview:
| Team and Plan Updates | Presentation Slides | Abstract | Project Desctiption (Data and Tasks) | EDA | Visual EDA | Modeling Pipelines | Results and Discussion of Results | Conclusion |
|---|---|---|---|---|---|---|---|---|
| Leona | Leona | Leona | Nimish | Jacob | Jacob | Olga | Nimish | Leona |
Tasks:
| Who | What | Time |
|---|---|---|
| Leona | Create the phase2 notebook and update the Phase Leader Plan and Credit Assignment Plan | 30 minutes |
| Nimish | Describe the HCDR Dataset, identify the tasks to be tackled, and provide diagams to aid understanding of the workflow | 1 hour |
| Jacob | Run exploratory data analysis, including a data dictionary of the raw features, dataset size, summary statistics, correlation analysis, and other text-based analysis | 1.5 hours |
| Jacob | Run visual exploratory data analysis, including a visualization of each input and taret features, a visualization of the correlation analysis, and pair-wise visualization of the input and output features, a graphic summary of the missing analysis, etc. | 1.5 hours |
| Olga | Create a visualization of the modeling pipelines/subpipelines and identify familes of input features/count per family, and number of input features. | 2 hours |
| Olga | Record an experment log with details including the baseline experiment, families of input features used, Accuracy scores, and AUC/ROC scores. | 1 hour |
| Leona | After all other work is complete, create the abstract and conclusion to summarize at a high level the other work, and what the project will be | 1 hour |
| Leona | Create slides for the group's video presentation based on everyone's collective work | 1 hour |
| Leona | Review all work (including abstract and conclusion) and ensure professional appearance for the entirety of the phase2 notebook | 1 hour |
| All | Record a 2-minute video presentation about the project and our findings. The video will have a logical and scientific flow to it. | 20 minutes |
In this project, our goal to create a machine learning (ML) model that can predict the likelihood of a borrower defaulting on their credit using unconventional data sources. The aim to provide lenders with a model that maximizes profits while minimizing the potential risk.In phase three of the project, we tackled the problem of predicting credit card default using machine learning techniques and focus on feature engineering and hyper tuning. Our main goal was to compare different models and identify the best score for this task.We conducted several experiments using different models and feature sets. The baseline experiments involved logistic regression models with 17 selected features and 120 original input features. These models achieved high accuracy and F1 scores. Next, we explored the use of L1 penalty in logistic regression models with selected features. We observed comparable results to the baseline models, suggesting that the L1 penalty did not improve performance.We also perfomed with random forest and gradient boosting models using selected features.Our observations suggest that gradient boosting with 17 selected features perfoms better then other models. This model achieved the highest accuracy, F1 score, and (AUROC) on both the validation and test sets.
The pipelines used in this project were used to increase the efficiency and ease of understanding of our code. Our most basic pipelines (Level 3 Pipelines) were used to prepare our selected input feature data. The numerical features and categorical features were each handled in their own pipelines. Numerical data was standardized while categorical data was one-hot-encoded. In both sets, the pipelines handled imputing missing values. The Level 2 Pipeline, was a column transformer used to streamline the preparation of data before it is applied to our classifier, combining both numerical and categorical pipelines. Lasty, the Level 1 Pipeline was used to combine the Level 2 data preparation pipeline to the classifer model.
Our baseline pipelines:
We also ran experiments using Logisitic regression with a newly feature engineered, Debt to Income Ratio, using application_train features: 'AMT_CREDIT' and 'AMT_INCOME TOTAL'. Debt to Income Ratio is a good measure of ability to pay back loans, showing how much of a person's income will go out to pay debt. According to Wells Fargo.com (https://www.wellsfargo.com/goals-credit/smarter-credit/credit-101/debt-to-income-ratio/understanding-dti/#:~:text=35%25%20or%20less%3A%20Looking%20Good,a%20lower%20DTI%20as%20favorable.), A Debt-to-Income Ratio of 35% of less is considered good, 36-49% shows room for improvement, and over 50% show a need to take action.
We also ran different experiments using different algorithms such as a Random Forest model and Gradient Boosting Classifier.
expLog.set_index('Experiment Number', inplace=True)
display(expLog)
| Model | # Transformed Input Features | # Original Numerical Features | # Original Categorical Features | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | Test F1 | ... | Valid AUROC | Test AUROC | Training Time | Training Prediction Time | Validation Prediction Time | Test Prediction Time | Hyperparameters | Best Parameter | Best Hypertuning Score | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Experiment Number | |||||||||||||||||||||
| 1 | Baseline 1, LogReg with Original 17 Selected F... | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.015 | 0.011 | ... | 0.738 | 0.737 | 5.7270 | 0.616539 | 0.235517 | 0.228772 | N/A | N/A | N/A | Baseline 1 LogReg Model with Preselected Num a... |
| 2 | Baseline 2, LogReg with original 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.019 | 0.018 | 0.021 | ... | 0.746 | 0.743 | 10.8317 | 4.910362 | 0.653457 | 0.585086 | N/A | N/A | N/A | Baseline 2 LogReg Model with Num and Cat Featu... |
| 3 | LogReg - L1 Penalty with Selected 17 Features | 49 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | 0.011 | ... | 0.738 | 0.737 | 21.0944 | 0.607865 | 0.234333 | 0.215436 | N/A | N/A | N/A | LogReg Model-L1 Penalty with Selected 17 Cat +... |
| 4 | LogReg - L1 Penalty with 120 Features | 250 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.017 | 0.014 | 0.018 | ... | 0.745 | 0.743 | 82.6300 | 1.839408 | 0.539958 | 0.517283 | N/A | N/A | N/A | LogReg Model-L1 Penalty with 104 Num + 16 Cat ... |
| 5 | LogReg with Num and Cat Features + Debt_Income... | 251 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.021 | 0.018 | 0.023 | ... | 0.746 | 0.743 | 10.3708 | 2.619332 | 0.641288 | 0.518304 | N/A | N/A | N/A | Logistic Regression Model with Original 120 N... |
| 6 | LogReg with Num and Cat Features + Debt_Income... | 50 | 104 | 16 | 0.92 | 0.916 | 0.919 | 0.014 | 0.016 | 0.012 | ... | 0.738 | 0.737 | 3.9352 | 0.593195 | 0.238655 | 0.220598 | N/A | N/A | N/A | Logistic Regression Model with Original 17 Nu... |
| 7 | Random Forest with 17 Features | 49 | 10 | 7 | 1.00 | 0.916 | 0.920 | 1.000 | 0.005 | 0.007 | ... | 0.721 | 0.724 | 14.7829 | 3.057214 | 0.778230 | 0.787176 | N/A | N/A | N/A | Random Forest Model with 10 Num + 7 Cat Features. |
| 8 | Gradboost with 17 Features | 17 | 10 | 7 | 0.92 | 0.916 | 0.920 | 0.023 | 0.014 | 0.022 | ... | 0.747 | 0.748 | 4.0657 | 1.022855 | 0.341756 | 0.315761 | N/A | N/A | N/A | Gradboost Model with 10 Num + 7 Cat Features. |
| 9 | Gradboost with 120 Features | 120 | 10 | 7 | 0.92 | 0.916 | 0.919 | 0.019 | 0.018 | 0.021 | ... | 0.746 | 0.743 | 9.2200 | 2.100374 | 0.598318 | 0.507419 | N/A | N/A | N/A | Gradboost Model with 104 Num + 16 Cat Features. |
9 rows × 21 columns
Logistic function
$$ \sigma(t) = \dfrac{1}{1 + \exp(-t)} $$Logistic Regression model prediction
$$ \hat{y} = \begin{cases} 0 & \text{if } \hat{p} < 0.5, \\ 1 & \text{if } \hat{p} \geq 0.5. \end{cases} $$Cost function of a single training instanc
$$ c(\boldsymbol{\theta}) = \begin{cases} -\log(\hat{p}) & \text{if } y = 1, \\ -\log(1 - \hat{p}) & \text{if } y = 0. \end{cases} $$Binary Cross-Entropy Loss (CXE)
Binary Cross Entropy loss, aka log loss, is a special case of negative log likelihood. It measures a classifier's performance, increases as the predicted probability moves farther from the true label. The goal in logistic regression is to minimize the CXE. $$ J(\boldsymbol{\theta}) = -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} $$
LASSO Binary Cross Entropy (LBXE) $$ J(\boldsymbol{\theta}) = -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} + \lambda \sum_{j=1}^{n}|w_j| $$
Ridge Binary Cross Entropy (RBXE) $$ J(\boldsymbol{\theta}) = -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} + \lambda \sum_{j=1}^{n}w_j^2 $$
Primal Soft Margin SVM Classifier $$ {\displaystyle \underset{W,b, \zeta}{\text{argmin }}{\overbrace{\dfrac{1}{2}}^A \underbrace{\mathbf{w}^T \cdot \mathbf{w}}_B \quad + }C\sum _{i=1}^{m}\zeta _{i}} $$
score_barchart(acc_df, "Accuracy")
$$
F_1 = \cfrac{2}{\cfrac{1}{\text{precision}} + \cfrac{1}{\text{recall}}} = 2 \times \cfrac{\text{precision}\, \times \, \text{recall}}{\text{precision}\, + \, \text{recall}} = \cfrac{TP}{TP + \cfrac{FN + FP}{2}}
$$
score_barchart(F1_df, "F1")
$$ \text{Specificity} = \cfrac{TN}{TN + FP} $$
$$ \text{FPR = 1 - Specificity} = \cfrac{FP}{TN + FP} $$
score_barchart(AUROC_df, "AUROC")
In Phase 3 of the HCDR Project, we used Home Credit's extensive data set to build baseline models to accurately predict whether a client with minimal credit history would pay back a loan. This real-world problem is highly relevant today in a society of rapidly growing wealth disparities. Using machine learning pipelines to preprocess and transform input features, we built several models that were evaluated by the performance metrics: accuracy score, F1 score, and AUROC.
References: